recipes icon indicating copy to clipboard operation
recipes copied to clipboard

step_lag doesn't retain history when applied to new data

Open ClaytonJY opened this issue 6 years ago • 1 comments

Perhaps I'm doing this incorrectly, but when using step_lag on a sequential/time-aware split (e.g. rsample::rolling_origin(), I expected the application of the lag on the test set to bring in values from the training set, which doesn't happen:

library(tidyverse)
library(rsample)
#> Loading required package: broom
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stringr':
#> 
#>     fixed
#> The following object is masked from 'package:stats':
#> 
#>     step

# meaningless data
tbl <- tibble(
  id = LETTERS[1:10],
  val = seq_len(10) * 10,
  out = val*val
)

# only need one to demo
split <- rolling_origin(tbl)$splits[[1]]

# see it
analysis(split)
#> # A tibble: 5 x 3
#>   id      val   out
#>   <chr> <dbl> <dbl>
#> 1 A        10   100
#> 2 B        20   400
#> 3 C        30   900
#> 4 D        40  1600
#> 5 E        50  2500
assessment(split)
#> # A tibble: 1 x 3
#>   id      val   out
#>   <chr> <dbl> <dbl>
#> 1 F        60  3600

# lag predictor by 1
rec <- tbl %>%
  recipe(out~val) %>%
  step_lag(all_predictors(), lag = 1) %>%
  prep(analysis(split), retain = TRUE)

# analysis
juice(rec)
#> # A tibble: 5 x 3
#>     val   out lag_1_val
#>   <dbl> <dbl>     <dbl>
#> 1    10   100        NA
#> 2    20   400        10
#> 3    30   900        20
#> 4    40  1600        30
#> 5    50  2500        40

# assessment
bake(rec, assessment(split))
#> # A tibble: 1 x 3
#>     val   out lag_1_val
#>   <dbl> <dbl>     <dbl>
#> 1    60  3600        NA

Created on 2018-09-07 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.4 (2018-03-15)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Detroit             
#>  date     2018-09-07
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                             
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.4)                     
#>  backports    1.1.2      2017-12-13 CRAN (R 3.4.4)                     
#>  base       * 3.4.4      2018-03-16 local                              
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.4.4)                     
#>  bindrcpp     0.2.2      2018-03-29 CRAN (R 3.4.4)                     
#>  broom      * 0.5.0      2018-07-17 CRAN (R 3.4.4)                     
#>  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.4)                     
#>  class        7.3-14     2015-08-30 CRAN (R 3.4.0)                     
#>  cli          1.0.0      2017-11-05 CRAN (R 3.4.4)                     
#>  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.4)                     
#>  compiler     3.4.4      2018-03-16 local                              
#>  crayon       1.3.4      2017-09-16 CRAN (R 3.4.4)                     
#>  datasets   * 3.4.4      2018-03-16 local                              
#>  devtools     1.13.6     2018-06-27 CRAN (R 3.4.4)                     
#>  digest       0.6.16     2018-08-22 CRAN (R 3.4.4)                     
#>  dplyr      * 0.7.6      2018-06-29 cran (@0.7.6)                      
#>  evaluate     0.11       2018-07-17 CRAN (R 3.4.4)                     
#>  fansi        0.3.0      2018-08-13 CRAN (R 3.4.4)                     
#>  forcats    * 0.3.0      2018-02-19 CRAN (R 3.4.4)                     
#>  ggplot2    * 3.0.0      2018-07-03 cran (@3.0.0)                      
#>  glue         1.3.0      2018-07-17 CRAN (R 3.4.4)                     
#>  gower        0.1.2      2017-02-23 CRAN (R 3.4.4)                     
#>  graphics   * 3.4.4      2018-03-16 local                              
#>  grDevices  * 3.4.4      2018-03-16 local                              
#>  grid         3.4.4      2018-03-16 local                              
#>  gtable       0.2.0      2016-02-26 CRAN (R 3.4.4)                     
#>  haven        1.1.2      2018-06-27 CRAN (R 3.4.4)                     
#>  hms          0.4.2.9000 2018-07-03 Github (tidyverse/hms@2e0a39a)     
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.4)                     
#>  httr         1.3.1      2017-08-20 CRAN (R 3.4.4)                     
#>  ipred        0.9-7      2018-08-14 CRAN (R 3.4.4)                     
#>  jsonlite     1.5        2017-06-01 CRAN (R 3.4.4)                     
#>  knitr        1.20       2018-02-20 CRAN (R 3.4.4)                     
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.3.3)                     
#>  lava         1.6.3      2018-08-10 CRAN (R 3.4.4)                     
#>  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.4)                     
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.4.4)                     
#>  magrittr     1.5        2014-11-22 CRAN (R 3.4.4)                     
#>  MASS         7.3-50     2018-04-30 CRAN (R 3.4.4)                     
#>  Matrix       1.2-14     2018-04-09 CRAN (R 3.4.4)                     
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.4.4)                     
#>  methods    * 3.4.4      2018-03-16 local                              
#>  modelr       0.1.2      2018-05-11 CRAN (R 3.4.4)                     
#>  munsell      0.5.0      2018-06-12 CRAN (R 3.4.4)                     
#>  nlme         3.1-137    2018-04-07 CRAN (R 3.4.4)                     
#>  nnet         7.3-12     2016-02-02 CRAN (R 3.4.0)                     
#>  pillar       1.3.0      2018-07-14 CRAN (R 3.4.4)                     
#>  pkgconfig    2.0.2      2018-08-16 CRAN (R 3.4.4)                     
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.4.4)                     
#>  prodlim      2018.04.18 2018-04-18 CRAN (R 3.4.4)                     
#>  purrr      * 0.2.5      2018-05-29 CRAN (R 3.4.4)                     
#>  R6           2.2.2      2017-06-17 CRAN (R 3.4.4)                     
#>  Rcpp         0.12.18    2018-07-23 CRAN (R 3.4.4)                     
#>  readr      * 1.2.0      2018-07-06 Github (tidyverse/readr@4b2e93a)   
#>  readxl       1.1.0      2018-04-20 CRAN (R 3.4.4)                     
#>  recipes    * 0.1.3.9000 2018-09-05 Github (topepo/recipes@cf2e5e6)    
#>  rlang        0.2.2      2018-08-16 cran (@0.2.2)                      
#>  rmarkdown    1.10       2018-06-11 CRAN (R 3.4.4)                     
#>  rpart        4.1-13     2018-02-23 CRAN (R 3.4.3)                     
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.4)                     
#>  rsample    * 0.0.2.9000 2018-09-07 Github (tidymodels/rsample@69e9782)
#>  rvest        0.3.2      2016-06-17 CRAN (R 3.4.4)                     
#>  scales       1.0.0      2018-08-09 CRAN (R 3.4.4)                     
#>  splines      3.4.4      2018-03-16 local                              
#>  stats      * 3.4.4      2018-03-16 local                              
#>  stringi      1.2.4      2018-07-20 CRAN (R 3.4.4)                     
#>  stringr    * 1.3.1      2018-05-10 CRAN (R 3.4.4)                     
#>  survival     2.42-6     2018-07-13 CRAN (R 3.4.4)                     
#>  tibble     * 1.4.2      2018-01-22 CRAN (R 3.4.4)                     
#>  tidyr      * 0.8.1      2018-05-18 CRAN (R 3.4.4)                     
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.4.4)                     
#>  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.4)                     
#>  timeDate     3043.102   2018-02-21 CRAN (R 3.4.4)                     
#>  tools        3.4.4      2018-03-16 local                              
#>  utf8         1.1.4      2018-05-24 CRAN (R 3.4.4)                     
#>  utils      * 3.4.4      2018-03-16 local                              
#>  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                     
#>  xml2         1.2.0      2018-01-24 CRAN (R 3.4.4)                     
#>  yaml         2.2.0      2018-07-25 CRAN (R 3.4.4)

I expected lag_1_val in the assessment set to pull from the analysis set and thus be 50, which didn't happen.

This is perhaps a big change to step_lag, but it seems essential for using this step in almost any real-world problem.

A tricky aspect to also consider is, what if I had set skip to be 1 or more? How could that be handled appropriately in step_lag? If skip = 1, the first obs in the assessment set would need to draw from that skipped observation, which would not normally be accessible to the recipe.

This is also related to the (long) discussion in https://github.com/tidymodels/rsample/issues/42, and the notion that skips can be implied by the lagging of features. Time makes everything harder!

ClaytonJY avatar Sep 07 '18 18:09 ClaytonJY