recipes icon indicating copy to clipboard operation
recipes copied to clipboard

Consider how `step_select()` handles outcomes and predictors

Open atusy opened this issue 4 years ago • 4 comments

The problem

If model is fit for workflows::workflow() and recipe steps include step_select, then predict.workflow fails. However, the same recipe works fine without using workflows package.

Reproducible example

library(magrittr)

preprocessor <- recipes::recipe(mtcars, mpg ~ .) %>%
  recipes::step_select(mpg, wt)
  
model <- parsnip::linear_reg() %>%
  parsnip::set_engine("lm")

# with workflows
workflows::workflow() %>%
  workflows::add_recipe(preprocessor) %>%
  workflows::add_model(model) %>%
  generics::fit(mtcars) %>%
  predict(mtcars)
#> Error: Can't subset columns that don't exist.
#> x Column `mpg` doesn't exist.

# without workflows
input <- preprocessor %>%
  recipes::prep() %>%
  recipes::juice()
model %>%
  generics::fit(mpg ~ ., data = input) %>%
  predict(input)
#> # A tibble: 32 x 1
#>    .pred
#>    <dbl>
#>  1  23.3
#>  2  21.9
#>  3  24.9
#>  4  20.1
#>  5  18.9
#>  6  18.8
#>  7  18.2
#>  8  20.2
#>  9  20.5
#> 10  18.9
#> # … with 22 more rows

Created on 2021-07-01 by the reprex package (v2.0.0)

Session Info

``` r
sessioninfo::session_info(c("workflows", "recipes", "parsnip"))
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.1.0 (2021-05-18)
#>  os       Ubuntu 20.04.2 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       Etc/UTC                     
#>  date     2021-07-01                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source                               
#>  class         7.3-19      2021-05-03 [2] CRAN (R 4.1.0)                       
#>  cli           2.5.0       2021-04-26 [1] RSPM (R 4.1.0)                       
#>  codetools     0.2-18      2020-11-04 [2] CRAN (R 4.1.0)                       
#>  cpp11         0.3.1       2021-06-25 [1] RSPM (R 4.1.0)                       
#>  crayon        1.4.1       2021-02-08 [1] RSPM (R 4.1.0)                       
#>  dplyr         1.0.7       2021-06-18 [1] RSPM (R 4.1.0)                       
#>  ellipsis      0.3.2       2021-04-29 [1] RSPM (R 4.1.0)                       
#>  fansi         0.5.0       2021-05-25 [1] RSPM (R 4.1.0)                       
#>  generics      0.1.0       2020-10-31 [1] RSPM (R 4.1.0)                       
#>  globals       0.14.0      2020-11-22 [1] RSPM (R 4.1.0)                       
#>  glue          1.4.2       2020-08-27 [1] RSPM (R 4.1.0)                       
#>  gower         0.2.2       2020-06-23 [1] RSPM (R 4.1.0)                       
#>  hardhat       0.1.5       2020-11-09 [1] RSPM (R 4.1.0)                       
#>  ipred         0.9-11      2021-03-12 [1] RSPM (R 4.1.0)                       
#>  KernSmooth    2.23-20     2021-05-03 [2] CRAN (R 4.1.0)                       
#>  lattice       0.20-44     2021-05-02 [2] CRAN (R 4.1.0)                       
#>  lava          1.6.9       2021-03-11 [1] RSPM (R 4.1.0)                       
#>  lifecycle     1.0.0       2021-02-15 [1] RSPM (R 4.1.0)                       
#>  lubridate     1.7.10      2021-02-26 [1] RSPM (R 4.1.0)                       
#>  magrittr      2.0.1       2020-11-17 [1] RSPM (R 4.1.0)                       
#>  MASS          7.3-54      2021-05-03 [2] CRAN (R 4.1.0)                       
#>  Matrix        1.3-3       2021-05-04 [2] CRAN (R 4.1.0)                       
#>  nnet          7.3-16      2021-05-03 [2] CRAN (R 4.1.0)                       
#>  numDeriv      2016.8-1.1  2019-06-06 [1] RSPM (R 4.1.0)                       
#>  parsnip       0.1.6.9000  2021-07-01 [1] Github (tidymodels/parsnip@89f8f93)  
#>  pillar        1.6.1       2021-05-16 [1] RSPM (R 4.1.0)                       
#>  pkgconfig     2.0.3       2019-09-22 [1] RSPM (R 4.1.0)                       
#>  prettyunits   1.1.1       2020-01-24 [1] RSPM (R 4.1.0)                       
#>  prodlim       2019.11.13  2019-11-17 [1] RSPM (R 4.1.0)                       
#>  purrr         0.3.4       2020-04-17 [1] RSPM (R 4.1.0)                       
#>  R6            2.5.0       2020-10-28 [1] RSPM (R 4.1.0)                       
#>  Rcpp          1.0.6       2021-01-15 [1] RSPM (R 4.1.0)                       
#>  recipes       0.1.16.9000 2021-07-01 [1] Github (tidymodels/recipes@39bc4e8)  
#>  rlang         0.4.11      2021-04-30 [1] RSPM (R 4.1.0)                       
#>  rpart         4.1-15      2019-04-12 [2] CRAN (R 4.1.0)                       
#>  SQUAREM       2021.1      2021-01-13 [1] RSPM (R 4.1.0)                       
#>  survival      3.2-11      2021-04-26 [2] CRAN (R 4.1.0)                       
#>  tibble        3.1.2       2021-05-16 [1] RSPM (R 4.1.0)                       
#>  tidyr         1.1.3       2021-03-03 [1] RSPM (R 4.1.0)                       
#>  tidyselect    1.1.1       2021-04-30 [1] RSPM (R 4.1.0)                       
#>  timeDate      3043.102    2018-02-21 [1] RSPM (R 4.1.0)                       
#>  utf8          1.2.1       2021-03-12 [1] RSPM (R 4.1.0)                       
#>  vctrs         0.3.8       2021-04-29 [1] RSPM (R 4.1.0)                       
#>  withr         2.4.2       2021-04-18 [1] RSPM (R 4.1.0)                       
#>  workflows     0.2.2.9000  2021-07-01 [1] Github (tidymodels/workflows@8ad5a9d)
#> 
#> [1] /usr/local/lib/R/site-library
#> [2] /usr/local/lib/R/library

Created on 2021-07-01 by the reprex package (v2.0.0)


</details>

atusy avatar Jul 01 '21 13:07 atusy

The problem is that you have used step_select() on your outcome with the default skip = FALSE. (You can read more about skipping steps for new data here, but you don't want to skip for the predictor here so I don't think that will help.)

The workflows package is very careful about separating predictors and outcomes to avoid data leakage; at prediction time, the outcome is not available, as a protection to all of us as users. This recipe you made says: "try to select the outcome" but the outcome is not available at prediction time. This is by design and is a feature of recipes + workflows.

If you can describe the real-world use case where you have run into this with a bit more detail, we can offer some advice for a solution, beyond, say, recipe(mpg ~ wt, data = mtcars).

juliasilge avatar Jul 01 '21 16:07 juliasilge

There does seem to be a little tension between this behavior and step_select(). Since step_select() requires you to specify the outcome to be able to keep it, to select just the numeric predictors you have to do step_select(outcome, all_numeric_predictors()). You also can't apply them in separate steps like:

rec %>%
  step_select(outcome, skip = TRUE) %>%
  step_select(all_numeric_predictors())

This doesn't work because the first selection will only keep the outcome, so the second one won't work correctly.

So I'm not sure you can use step_select() in combination with workflows/hardhat right now as is?

I wonder if we should have called it step_select_predictors(), or just forced step_select() to only be applied on the predictors? i.e. you'd always get the outcome columns, you never have to select them and can't select them away. That way at bake() time it never tries to select an outcome column, so we don't have issues here in workflows' predict() method.

DavisVaughan avatar Jul 01 '21 16:07 DavisVaughan

That's a great point; I'm going to move this to recipes and we can consider if we want to make any changes to step_select(), or better document that it's not a great choice for use in a modeling analysis (i.e. use update_role() without the formula interface or something like that, depending on what you are trying to do).

juliasilge avatar Jul 01 '21 19:07 juliasilge

Hi, I just ran into this issue as well and it took me some time to figure out what's going on. It would be great to document this or update the behaviour or even create a new step (e.g. step_select_predictors as suggested above). In the meantime, there is a workaround that I have used using step_rm if somebody runs into the same issue... :)

step_rm(recipe, all_predictors(), -all_of(<variables that you want to keep>))

Thanks for the great tidymodels framework, great stuff!

slamao avatar Jan 04 '23 16:01 slamao