r - Extract string and its location using dplyr/tidyr approach -
the input data frame has 3 id columns , 1 raw_text. u_id corresponds user, doc_id corresponds document of particular user , sentence id corresponds sentence within document of user.
df <- data.frame(u_id=c(1,1,1,1,1,2,2,2), doc_id=c(1,1,1,2,2,1,1,2), sent_id=c(1,2,3,1,2,1,2,1), text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:", "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .", "method exists , former because calls corresponding", "admission date: 2001-4-19 discharge date: 2002-5-23 service:", "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .", "method exists , former because calls corresponding", "method exists , former because calls corresponding", "method exists , former because calls corresponding"))
let's assume need extract dates , location raw_text. approach far -
#define regex date date<-"([0-9]{2,4})[- . /]([0-9]{1,4})[- . /]([0-9]{2,4})" #library library(dplyr) library(stringr) #extract dates df_i<-df %>% mutate(i=str_extract_all(text,date)) %>% mutate(date=lapply(i, function(x) if(identical(x, character(0))) na_character_ else x)) %>% unnest(date) #extract date locations df_ii<-str_locate_all(df$text,date) n<-max(sapply(df_ii, nrow)) date_loc<-as.data.frame(do.call(rbind, lapply(df_ii, function (x) rbind(x, matrix(, n-nrow(x), ncol(x))))))
the date extractions in data.frame format. there approach put string_locations in data.frame format corresponding id , string? ideally, output should -
output<-data.frame(id=c(1,1,2,2,3), text=c("admission date: 2001-4-19 discharge date: 2002-5-23 service:", "admission date: 2001-4-19 discharge date: 2002-5-23 service:", "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .", "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 .", "pertinent results: 2105-4-16 05:02pm gap-14 2105-4-16 04:23pm rdw-13.1 2105-4-16 ."), date=c("2001-4-19","2002-5-23","2105-4-16","2105-4-16","13.1 2105"), date_start=c(17,43,20,74,96), date_end=c(25,51,28,82,104))
you can this:
regex = "\\b[0-9]+[-][0-9]+[-][0-9]+\\b" df_i = str_extract_all(df$text, regex) df_ii = str_locate_all(df$text, regex) output1 = map(function(x, y, z){ if(length(y) == 0){ y = na } if(nrow(z) == 0){ z = rbind(z, list(start = na, end = na)) } data.frame(id = x, date = y, z) }, df$id, df_i, df_ii) %>% do.call(rbind,.) %>% merge(df, .)
or stick piping-only syntax:
regex = "[0-9]+[-][0-9]+[-][0-9]+" output1 = df %>% {list(.$id, str_extract_all(.$text, regex), str_locate_all(.$text, regex))} %>% {map(function(x, y, z){ if(length(y) == 0){ y = na } if(nrow(z) == 0){ z = rbind(z, list(start = na, end = na)) } data.frame(id = x, date = y, z) }, .[[1]], .[[2]], .[[3]])} %>% do.call(rbind, .) %>% merge(df, .)
result:
id 1 1 2 1 3 2 4 2 5 2 6 3 text 1 admission date: 2001-4-19 discharge date: 2002-5-23 service: 2 admission date: 2001-4-19 discharge date: 2002-5-23 service: 3 pertinent results: 2105-4-16 05:02pm gap-14 \n 2105-4-16 04:23pm rdw-13.1 2105-4-16 . 4 pertinent results: 2105-4-16 05:02pm gap-14 \n 2105-4-16 04:23pm rdw-13.1 2105-4-16 . 5 pertinent results: 2105-4-16 05:02pm gap-14 \n 2105-4-16 04:23pm rdw-13.1 2105-4-16 . 6 method exists , former because calls corresponding date start end 1 2001-4-19 17 25 2 2002-5-23 43 51 3 2105-4-16 20 28 4 2105-4-16 77 85 5 2105-4-16 104 112 6 <na> na na
notes:
- your regular expression incorrectly extracts "13.1" "rdw-13.1 2105-4-16" because added spaces in
[- . /]
.date<-"([0-9]{2,4})[-./]([0-9]{1,4})[-./]([0-9]{2,4})"
should it. mutate
allows use variable have created inside same function call, there no need use 2 separatemutate
'sdf_i
.- for pipping-only solution,
{}
needed aroundlist()
,map()
override dplyr default of feeding in output step first argument of next function.
for instance:
df %>% list(.$id, str_extract_all(.$text, regex), str_locate_all(.$text, regex))
becomes:
list(df, df$id, str_extract_all(df$text, regex), str_locate_all(df$text, regex))
which not want.
edits:
op updated df
include rows text
not include dates
. cause original solution fail since elements of list str_extract_all
, str_locate_all
have length(0)
, nrow(0)
. solved issue adding 2 if
statements:
if(length(y) == 0){ y = na } if(nrow(z) == 0){ z = rbind(z, list(start = na, end = na)) }
this makes dates = "na
, adds row of na's start
, end
rows no dates. allows id
have 1 row bind in data.frame
step.
Comments
Post a Comment