r - dplyr group_by and iterative loop calculation -


i trying perform iterative calculation on grouped data depend on 2 previous elements within group. toy example:

set.seed(100) df = data.table(id = c(rep("a_index1",9)),             year = c(2001:2005, 2001:2004),              price = c(na, na, 10, na, na, 15, na, 13, na),             index = sample(seq(1, 3, = 0.5), size = 9, replace = true))      id year price index  r> df 1: a_index1 2001    na   1.5 2: a_index1 2002    na   1.5 3: a_index1 2003    10   2.0 4: a_index1 2004    na   1.0 5: a_index1 2005    na   2.0 6: a_index1 2006    15   2.0 7: a_index1 2007    na   3.0 8: a_index1 2008    13   1.5 9: a_index1 2009    na   2.0 

the objective fill missing prices using last available price , index adjust. have loop performs these calculations, trying vectorize using dplyr.

my logic defined in below loop:

df$price_adj = df$price (i in 2:nrow(df)) {   if (is.na(df$price[i])) {     df$price_adj[i] = round(df$price_adj[i-1] * df$index[i] / df$index[i-1], 2)   } }  r> df          id year price index price_adj 1: a_index1 2001    na   1.5        na 2: a_index1 2002    na   1.5        na 3: a_index1 2003    10   2.0     10.00 4: a_index1 2004    na   1.0      5.00 5: a_index1 2005    na   2.0     10.00 6: a_index1 2006    15   2.0     15.00 7: a_index1 2007    na   3.0     22.50 8: a_index1 2008    13   1.5     13.00 9: a_index1 2009    na   2.0     17.33 

in actual large data, have apply function multiple groups , speed consideration. attempt @ below, needs point me in right direction. did consider reduce, not sure how can incorporate 2 previous elements within group.

foo = function(price, index){   (i in 2:nrow(df)) {     if (is.na(df$price[i])) {       df$price_adj[i] = df$price_adj[i-1] * df$index[i] / df$index[i-1]     }   } }  df %>%    group_by(id) %>%    mutate(price_adj = price,          price_adj = foo(price, index)) 

one option cumprod:

df %>%      # group data frame chunks starting non na price     group_by(id, g = cumsum(!is.na(price))) %>%      # each chunk multiply first non na price cumprod of index[i]/index[i-1]     mutate(price_adj = round(first(price) * cumprod(index / lag(index, default=first(index))), 2)) %>%      ungroup() %>% select(-g)  # tibble: 9 x 5 #        id  year price index price_adj #    <fctr> <int> <dbl> <dbl>     <dbl> #1 a_index1  2001    na   1.5        na #2 a_index1  2002    na   1.5        na #3 a_index1  2003    10   2.0     10.00 #4 a_index1  2004    na   1.0      5.00 #5 a_index1  2005    na   2.0     10.00 #6 a_index1  2001    15   2.0     15.00 #7 a_index1  2002    na   3.0     22.50 #8 a_index1  2003    13   1.5     13.00 #9 a_index1  2004    na   2.0     17.33 
  • group data frame id , cumsum(!is.na(price)), letter split data frame chunks , each chunk start non na price;

  • first(price) * cumprod(index / lag(index, default=first(index))) iterative calculation, equivalent formula given in question if substitute price_adj[i-1] price_adj[i-2] until it's price_adj[1] or first(price);

caveat: may not efficient if have many na chunks.


if speed primary concern, write function using rcpp package:

library(rcpp) cppfunction("     numericvector price_adj(numericvector price, numericvector index) {         int n = price.size();         numericvector adjusted_price(n);         adjusted_price[0] = price[0];         (int = 1; < n; i++) {             if(numericvector::is_na(price[i])) {                 adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1];             } else {                 adjusted_price[i] = price[i];             }         }         return adjusted_price;     }") 

now use cpp function dplyr follows:

cpp_fun <- function() df %>% group_by(id) %>% mutate(price_adj = round(price_adj(price, index), 2))  cpp_fun() # tibble: 9 x 5 # groups:   id [1] #        id  year price index price_adj #    <fctr> <int> <dbl> <dbl>     <dbl> #1 a_index1  2001    na   1.5        na #2 a_index1  2002    na   1.5        na #3 a_index1  2003    10   2.0     10.00 #4 a_index1  2004    na   1.0      5.00 #5 a_index1  2005    na   2.0     10.00 #6 a_index1  2001    15   2.0     15.00 #7 a_index1  2002    na   3.0     22.50 #8 a_index1  2003    13   1.5     13.00 #9 a_index1  2004    na   2.0     17.33 

benchmark:

define r_fun as:

r_fun <- function() df %>% group_by(id, g = cumsum(!is.na(price))) %>% mutate(price_adj = round(first(price) * cumprod(index / lag(index, default=first(index))), 2)) %>% ungroup() %>% select(-g) 

on small sample data, there's difference:

microbenchmark::microbenchmark(r_fun(), cpp_fun()) #unit: milliseconds #      expr       min        lq      mean    median        uq        max neval #   r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975   100 # cpp_fun()  3.191278  3.308758  3.738809  3.491495  3.937006   6.627019   100 

testing on larger data frame:

df <- bind_rows(rep(list(df), 10000)) #dim(df) #[1] 90000     4  microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10) #unit: milliseconds #      expr        min         lq      mean    median        uq       max neval #   r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576    10 # cpp_fun()   8.722794   8.888667  10.67781  10.86399  12.10647  13.68302    10 

identity test:

identical(ungroup(r_fun()), ungroup(cpp_fun())) # [1] true 

Comments

Popular posts from this blog

ZeroMQ on Windows, with Qt Creator -

unity3d - Unity SceneManager.LoadScene quits application -

python - Error while using APScheduler: 'NoneType' object has no attribute 'now' -