python - Pandas - DateTime within X amount of minutes from row -
i not entirely positive best way ask or phrase question highlight problem, dataset, thoughts on method , end goal , clear end.
my problem:
my company dispatches workers , load dispatches single employee if on current dispatch. due limitation in software use. if employee receives 2 dispatches within 30 minutes, call double dispatch.
we analyzing our dispatching efficiency , running bit of head scratcher. need run through our 100k row database , add additional column read dummy variable 1 double 0 normal. have multiple people dispatch , b our records not start ordered dispatch, need determine how dispatch occurs same person within 30 minutes.
dataset:
the dataset incredibly massive due poor organization in our data warehouse terms of items need these columns need calc.
tech name | dispatch time (pst) john smith | 1/1/2017 12:34 jane smith | 1/1/2017 12:46 john smith | 1/1/2017 18:32 john smith | 1/1/2017 18:50
my thoughts: how clunky , work 1 way not backwards. more or less write code as:
import pandas pd df = pd.read_excel('data.xlsx') df.sort('dispatch time (pst)', inplace = true) tech_name = none dispatch_time = pd.to_datetime('1/1/1900 00:00:00') index, row in df.iterrows(): if tech_name none: tech_name = row['tech name'] else: if dispatch_time.pd.time_delta('0 days 00:30:00') > row['tech dispatch time (pst)'] , row['tech name'] = tech_name: row['double dispatch'] = 1 dispatch_time = row['tech dispatch time (pst)'] else: dispatch_time = row['tech dispatch time (pst)'] tech_name = row['tech name']
this has many problems being slow, tracking dates going backwards , not forwards missing many dispatches.
end goal:
my goal have dataset can plug tableau report adding on 1 column reads dummy variable can filter , calculate on that.
i appreciate time , , let me know if more details necessary.
thank you!
------------------ edit ------------- added edit make question clear failed earlier.
question: pandas best tool use iterate on dataframe see each each datetime dispatch, there record matches tech's name , less 30 minutes away record.
if so, how improve algorithm or theory, if not best tool be.
desired output - additional column records if dispatch happened within 30 minute window dummy variable 1 true 0 false. need see when double dispatches occuring , how many records true double dispatches, , not count says there 100 instances of double dispatch, involved on 200 records. need able sort , see each record.
hello think found solution. slow, compares 1 index before or after, in terms of cases have 3 dispatches within thirty minutes, represents less .5 % us.
import pandas pd import numpy np import datetime dt dispatch = 'tech dispatched date-time (pst)' tech = 'combinedtech' df = pd.read_excel('combined_data.xlsx') df.sort_values(dispatch, inplace=true) df.reset_index(inplace = true) df['double dispatch'] = np.nan writer = pd.excelwriter('final_output.xlsx', engine='xlsxwriter') dispatch_count = 0 time = dt.timedelta(minutes = 30) index, row in df.iterrows(): try: tech_one = df[tech].loc[(index - 1)] dispatch_one = df[dispatch].loc[(index - 1)] except keyerror: tech_one = none dispatch_one = pd.to_datetime('1/1/1990 00:00:00') try: tech_two = df[tech].loc[(index + 1)] dispatch_two = df[dispatch].loc[(index + 1)] except keyerror: tech_two = none dispatch_two = pd.to_datetime('1/1/2020 00:00:00') first_time = dispatch_one + time second_time = pd.to_datetime(row[dispatch]) + time dispatch_pd = pd.to_datetime(row[dispatch]) if tech_one == row[tech] or tech_two == row[tech]: if first_time > row[dispatch] or second_time > dispatch_two: df.set_value(index, 'double dispatch', 1) dispatch_count += 1 else: df.set_value(index, 'double dispatch', 0) dispatch_count += 1 print(dispatch_count) # monitor total # of records being pushed through df.to_excel(writer,sheet_name='sheet1') writer.save() writer.close()
Comments
Post a Comment