Using python 3.6 to transform dirty data from and Access database -


when 19 or 20 dad typed records of british army officers in world war 1 access 1.0 database. last 27 years or until remembered had done , decided should "put on internet" him (i had been drinking agreed).

the news it's single table, bad news data typed in 1990 has not stood test of time. there lot of abbreviation , different spellings of things should same spoil of document , make search difficult.

the csv header looks like:

soldier,soldiername,notes,christian name,rank,regiment,appointment,grade,date appointed 

the data looks like:

2876,"furber","vice capt g w r stacpoole dso res of off lg29113 2984 26/3/1915.   replaced lt h gordon ind army res of off lg29555 4121 20/4/1916.","m","capt","re 

see how word "captain" spelt "capt", spelt "cptn", "cpt" , on.

i'd process file , correct these. using python 3.6 read in csv , substitute ranks in dictionary keys corrected ranks:

correct_ranks = {     'cap'                        : 'captain',     'temp cap'                   : 'temporary captain',     'temp capt'                  : 'temporary captain',     'temp capt act maj'          : 'temporary captain acting major',     'temp capt temp maj'         : 'temporary captain temporary major',     'temp capt hon maj'          : 'temporary captain honorable major',     'temp capt hon'              : 'temporary captain honorable',     'capt brevet lt col'         :  'captain brevet lieutenant colonel',     'capt earl of'               :  'captain earl of',     'capt hon'                   :  'captain honorable',     'hon capt'                   :  'captain honorable', } 

i've got similar job regiments belong earliest solution use list of regexes:

s =  re.sub(r'\bdrgns\b', 'dragoons', s) s =  re.sub(r'\bdgn\b', 'dragoon', s) s =  re.sub(r'\bgds\b', 'guards', s) s =  re.sub(r'\binf\b', 'infantry', s) s =  re.sub(r'\bbde\b', 'brigade', s) s =  re.sub(r'\bhsld\b', 'household', s) s =  re.sub(r'\bres\b', 'reserve', s) s =  re.sub(r'\byeo\b', 'yeomanry', s) s =  re.sub(r'\bco of\b', 'county of', s) s =  re.sub(r'\bco\b', 'county', s) s =  re.sub(r'\blon(d?)\b', 'london', s) 

i've got working pretty ok string doesn't seem transformed @ end solution seems "naive". question:

what ways of approaching task (i can program little in python , i've got of steps working extracting access mdb , publishing google sheet. suggestions strategies cleanup , tidying help.


Comments

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

unity3d - Unity SceneManager.LoadScene quits application -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -