r - Extract string and its location using dplyr/tidyr approach -

the input data frame has 3 id columns , 1 raw_text. u_id corresponds user, doc_id corresponds document of particular user , sentence id corresponds sentence within document of user.

df <- data.frame(u_id=c(1,1,1,1,1,2,2,2),                  doc_id=c(1,1,1,2,2,1,1,2),                  sent_id=c(1,2,3,1,2,1,2,1),                  text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",                                "pertinent results: 2105-4-16 05:02pm gap-14                                 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",                                "method exists , former because calls corresponding",                         "admission date: 2001-4-19 discharge date: 2002-5-23 service:",                         "pertinent results: 2105-4-16 05:02pm gap-14                          2105-4-16 04:23pm rdw-13.1 2105-4-16 .",                         "method exists , former because calls corresponding",                         "method exists , former because calls corresponding",                         "method exists , former because calls corresponding"))

let's assume need extract dates , location raw_text. approach far -

#define regex date date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})"  #library library(dplyr) library(stringr)  #extract dates df_i<-df %>%    mutate(i=str_extract_all(text,date)) %>%    mutate(date=lapply(i, function(x) if(identical(x, character(0))) na_character_ else x)) %>%    unnest(date)  #extract date locations df_ii<-str_locate_all(df$text,date) n<-max(sapply(df_ii, nrow)) date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x)    rbind(x, matrix(, n-nrow(x), ncol(x))))))

the date extractions in data.frame format. there approach put string_locations in data.frame format corresponding id , string? ideally, output should -

output<-data.frame(id=c(1,1,2,2,3),                text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:",                       "admission date: 2001-4-19 discharge date: 2002-5-23 service:",                       "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",                       "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .",                       "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 ."),                date=c("2001-4-19","2002-5-23","2105-4-16","2105-4-16","13.1 2105"),                date_start=c(17,43,20,74,96),                date_end=c(25,51,28,82,104))

you can this:

regex = "\\b[0-9]+[-][0-9]+[-][0-9]+\\b" df_i = str_extract_all(df$text, regex)  df_ii = str_locate_all(df$text, regex)   output1 = map(function(x, y, z){   if(length(y) == 0){     y = na   }   if(nrow(z) == 0){     z = rbind(z, list(start = na, end = na))   }   data.frame(id = x, date = y, z) }, df$id, df_i, df_ii) %>%   do.call(rbind,.) %>%   merge(df, .)

or stick piping-only syntax:

regex = "[0-9]+[-][0-9]+[-][0-9]+"  output1 = df %>%   {list(.$id, str_extract_all(.$text, regex),         str_locate_all(.$text, regex))} %>%   {map(function(x, y, z){     if(length(y) == 0){       y = na     }     if(nrow(z) == 0){       z = rbind(z, list(start = na, end = na))     }     data.frame(id = x, date = y, z)   }, .[[1]], .[[2]], .[[3]])} %>%   do.call(rbind, .) %>%   merge(df, .)

result:

  id 1  1 2  1 3  2 4  2 5  2 6  3                                                                                                                  text 1                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service: 2                                                        admission date: 2001-4-19 discharge date: 2002-5-23 service: 3 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 . 4 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 . 5 pertinent results: 2105-4-16 05:02pm gap-14 \n                               2105-4-16 04:23pm rdw-13.1 2105-4-16 . 6                                                     method exists , former because calls corresponding        date start end 1 2001-4-19    17  25 2 2002-5-23    43  51 3 2105-4-16    20  28 4 2105-4-16    77  85 5 2105-4-16   104 112 6      <na>    na  na

notes:

your regular expression incorrectly extracts "13.1" "rdw-13.1 2105-4-16" because added spaces in [- . /]. date<-"([0-9]{2,4})[-./]([0-9]{1,4})[-./]([0-9]{2,4})" should it.
mutate allows use variable have created inside same function call, there no need use 2 separate mutate's df_i.
for pipping-only solution, {} needed around list() , map() override dplyr default of feeding in output step first argument of next function.

for instance:

df %>%       list(.$id, str_extract_all(.$text, regex),                   str_locate_all(.$text, regex))

becomes:

list(df, df$id, str_extract_all(df$text, regex),                  str_locate_all(df$text, regex))

which not want.

edits:

op updated df include rows text not include dates. cause original solution fail since elements of list str_extract_all , str_locate_all have length(0) , nrow(0). solved issue adding 2 if statements:

if(length(y) == 0){   y = na } if(nrow(z) == 0){   z = rbind(z, list(start = na, end = na)) }

this makes dates = "na , adds row of na's start , end rows no dates. allows id have 1 row bind in data.frame step.

Search This Blog

ANy

r - Extract string and its location using dplyr/tidyr approach -

Comments

Post a Comment

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

ios - MKAnnotationView layer is not of expected type: MKLayer -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -