r - dplyr group_by and iterative loop calculation -
i trying perform iterative calculation on grouped data depend on 2 previous elements within group. toy example:
set.seed(100) df = data.table(id = c(rep("a_index1",9)), year = c(2001:2005, 2001:2004), price = c(na, na, 10, na, na, 15, na, 13, na), index = sample(seq(1, 3, = 0.5), size = 9, replace = true)) id year price index r> df 1: a_index1 2001 na 1.5 2: a_index1 2002 na 1.5 3: a_index1 2003 10 2.0 4: a_index1 2004 na 1.0 5: a_index1 2005 na 2.0 6: a_index1 2006 15 2.0 7: a_index1 2007 na 3.0 8: a_index1 2008 13 1.5 9: a_index1 2009 na 2.0
the objective fill missing prices using last available price , index adjust. have loop performs these calculations, trying vectorize using dplyr
.
my logic defined in below loop:
df$price_adj = df$price (i in 2:nrow(df)) { if (is.na(df$price[i])) { df$price_adj[i] = round(df$price_adj[i-1] * df$index[i] / df$index[i-1], 2) } } r> df id year price index price_adj 1: a_index1 2001 na 1.5 na 2: a_index1 2002 na 1.5 na 3: a_index1 2003 10 2.0 10.00 4: a_index1 2004 na 1.0 5.00 5: a_index1 2005 na 2.0 10.00 6: a_index1 2006 15 2.0 15.00 7: a_index1 2007 na 3.0 22.50 8: a_index1 2008 13 1.5 13.00 9: a_index1 2009 na 2.0 17.33
in actual large data, have apply function multiple groups , speed consideration. attempt @ below, needs point me in right direction. did consider reduce
, not sure how can incorporate 2 previous elements within group.
foo = function(price, index){ (i in 2:nrow(df)) { if (is.na(df$price[i])) { df$price_adj[i] = df$price_adj[i-1] * df$index[i] / df$index[i-1] } } } df %>% group_by(id) %>% mutate(price_adj = price, price_adj = foo(price, index))
one option cumprod
:
df %>% # group data frame chunks starting non na price group_by(id, g = cumsum(!is.na(price))) %>% # each chunk multiply first non na price cumprod of index[i]/index[i-1] mutate(price_adj = round(first(price) * cumprod(index / lag(index, default=first(index))), 2)) %>% ungroup() %>% select(-g) # tibble: 9 x 5 # id year price index price_adj # <fctr> <int> <dbl> <dbl> <dbl> #1 a_index1 2001 na 1.5 na #2 a_index1 2002 na 1.5 na #3 a_index1 2003 10 2.0 10.00 #4 a_index1 2004 na 1.0 5.00 #5 a_index1 2005 na 2.0 10.00 #6 a_index1 2001 15 2.0 15.00 #7 a_index1 2002 na 3.0 22.50 #8 a_index1 2003 13 1.5 13.00 #9 a_index1 2004 na 2.0 17.33
group data frame
id
,cumsum(!is.na(price))
, letter split data frame chunks , each chunk start non na price;first(price) * cumprod(index / lag(index, default=first(index)))
iterative calculation, equivalent formula given in question if substituteprice_adj[i-1]
price_adj[i-2]
until it'sprice_adj[1]
orfirst(price)
;
caveat: may not efficient if have many na chunks.
if speed primary concern, write function using rcpp
package:
library(rcpp) cppfunction(" numericvector price_adj(numericvector price, numericvector index) { int n = price.size(); numericvector adjusted_price(n); adjusted_price[0] = price[0]; (int = 1; < n; i++) { if(numericvector::is_na(price[i])) { adjusted_price[i] = adjusted_price[i-1] * index[i] / index[i-1]; } else { adjusted_price[i] = price[i]; } } return adjusted_price; }")
now use cpp
function dplyr
follows:
cpp_fun <- function() df %>% group_by(id) %>% mutate(price_adj = round(price_adj(price, index), 2)) cpp_fun() # tibble: 9 x 5 # groups: id [1] # id year price index price_adj # <fctr> <int> <dbl> <dbl> <dbl> #1 a_index1 2001 na 1.5 na #2 a_index1 2002 na 1.5 na #3 a_index1 2003 10 2.0 10.00 #4 a_index1 2004 na 1.0 5.00 #5 a_index1 2005 na 2.0 10.00 #6 a_index1 2001 15 2.0 15.00 #7 a_index1 2002 na 3.0 22.50 #8 a_index1 2003 13 1.5 13.00 #9 a_index1 2004 na 2.0 17.33
benchmark:
define r_fun
as:
r_fun <- function() df %>% group_by(id, g = cumsum(!is.na(price))) %>% mutate(price_adj = round(first(price) * cumprod(index / lag(index, default=first(index))), 2)) %>% ungroup() %>% select(-g)
on small sample data, there's difference:
microbenchmark::microbenchmark(r_fun(), cpp_fun()) #unit: milliseconds # expr min lq mean median uq max neval # r_fun() 10.127839 10.500281 12.627831 11.148093 12.686662 101.466975 100 # cpp_fun() 3.191278 3.308758 3.738809 3.491495 3.937006 6.627019 100
testing on larger data frame:
df <- bind_rows(rep(list(df), 10000)) #dim(df) #[1] 90000 4 microbenchmark::microbenchmark(r_fun(), cpp_fun(), times = 10) #unit: milliseconds # expr min lq mean median uq max neval # r_fun() 842.706134 890.978575 904.70863 908.77042 921.89828 986.44576 10 # cpp_fun() 8.722794 8.888667 10.67781 10.86399 12.10647 13.68302 10
identity test:
identical(ungroup(r_fun()), ungroup(cpp_fun())) # [1] true
Comments
Post a Comment