recipes icon indicating copy to clipboard operation
recipes copied to clipboard

Deselecting a variable with `step_select(-my_variable)` causes prediction to fail

Open adisarid opened this issue 3 years ago • 4 comments
trafficstars

A surprising error on step_select. If you deselect a predictor using step_select, then fit a model with the recipe and try to predict, it will yield an error message

Error in `step_select()`:
! The following required column is missing from `new_data` in step 'select_***': ***.
Run `rlang::last_error()` to see where the error occurred.

I'm guessing that the reason for that is that using the notation with a "-" sign actually what the step is doing is selecting all other variables, among them the outcome variable, which is not supposed to be a part of the prediction. Also see this related post.

For me, an improved error message would have saved 30 minutes of trying to figure out the problem. I.e., add "did you remember to skip=TRUE?" to the step_skip function.

A more bullet-proof solution would be to always ignore the outcome variable when predicting, if the step_select was used to deselect stuff.

Here is a reprex:

library(tidyverse)
library(tidymodels)

set.seed(42)
mtcars_split <- initial_split(mtcars)

training_lm <- training(mtcars_split)
testing_lm <- testing(mtcars_split)

lm_recipe <- recipe(mpg ~ ., training(mtcars_split)) %>% 
  step_select(-carb)  # <--- *** HERE IS THE PROBLEM *** When adding skip=TRUE it works around the issue

lm_spec <- linear_reg() %>% 
  set_engine("lm")

lm_model <- workflow() %>% 
  add_recipe(lm_recipe) %>% 
  add_model(lm_spec) %>% 
  fit(testing_lm)

lm_model %>% 
  augment(testing_lm)

The output:

Error in `step_select()`:
! The following required column is missing from `new_data` in step 'select_BNc5G': mpg.
Run `rlang::last_error()` to see where the error occurred.

and my session info:

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yardstick_1.0.0    workflowsets_1.0.0 workflows_1.0.0    tune_1.0.0         rsample_1.0.0      recipes_1.0.1      parsnip_1.0.0     
 [8] modeldata_1.0.0    infer_1.0.2        dials_1.0.0        scales_1.2.0       broom_1.0.0        tidymodels_1.0.0   forcats_0.5.1     
[15] stringr_1.4.0      dplyr_1.0.9        purrr_0.3.4        readr_2.1.2        tidyr_1.2.0        tibble_3.1.7       ggplot2_3.3.6     
[22] tidyverse_1.3.1   

loaded via a namespace (and not attached):
 [1] httr_1.4.2         foreach_1.5.2      jsonlite_1.8.0     splines_4.2.0      prodlim_2019.11.13 modelr_0.1.8       assertthat_0.2.1  
 [8] GPfit_1.0-8        cellranger_1.1.0   globals_0.14.0     ipred_0.9-13       pillar_1.7.0       backports_1.4.1    lattice_0.20-45   
[15] glue_1.6.2         digest_0.6.29      rvest_1.0.2        hardhat_1.2.0      colorspace_2.0-3   Matrix_1.4-1       timeDate_4021.104 
[22] pkgconfig_2.0.3    lhs_1.1.5          DiceDesign_1.9     listenv_0.8.0      haven_2.5.0        gower_1.0.0        lava_1.6.10       
[29] tzdb_0.3.0         generics_0.1.2     ellipsis_0.3.2     furrr_0.3.0        withr_2.5.0        nnet_7.3-17        cli_3.3.0         
[36] survival_3.3-1     magrittr_2.0.3     crayon_1.5.1       readxl_1.4.0       fs_1.5.2           fansi_1.0.3        future_1.25.0     
[43] parallelly_1.31.1  MASS_7.3-56        xml2_1.3.3         class_7.3-20       tools_4.2.0        hms_1.1.1          lifecycle_1.0.1   
[50] munsell_0.5.0      reprex_2.0.1       compiler_4.2.0     rlang_1.0.4        grid_4.2.0         iterators_1.0.14   rstudioapi_0.13   
[57] gtable_0.3.0       codetools_0.2-18   DBI_1.1.3          R6_2.5.1           lubridate_1.8.0    future.apply_1.9.0 utf8_1.2.2        
[64] stringi_1.7.6      parallel_4.2.0     Rcpp_1.0.8.3       vctrs_0.4.1        rpart_4.1.16       dbplyr_2.1.1       tidyselect_1.1.2  

adisarid avatar Jul 28 '22 11:07 adisarid

Generally we advice that you use step_rm() instead of step_select() to remove variables for some of these reasons 🙂

EmilHvitfeldt avatar Jul 28 '22 12:07 EmilHvitfeldt

Sounds reasonable and I will do it with step_rm moving forward.

Intuitively (as someone moving to tidymodels, fluent in dplyr), using step_select to deselect variable was the most logical thing to try.

Is there a way to improve the error message, to make it clearer to new users?

adisarid avatar Jul 29 '22 19:07 adisarid

Is there a way to improve the error message, to make it clearer to new users?

We will see what we can do!

This is also related to https://github.com/tidymodels/recipes/issues/741

EmilHvitfeldt avatar Mar 30 '23 21:03 EmilHvitfeldt

More minimal reprex:

library(recipes)

rec <- recipe(mpg ~ ., data = mtcars) |>
  step_select(-vs) |>
  prep()

rec |>
  bake(new_data = mtcars)
#> # A tibble: 32 × 10
#>      cyl  disp    hp  drat    wt  qsec    am  gear  carb   mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     6  160    110  3.9   2.62  16.5     1     4     4  21  
#>  2     6  160    110  3.9   2.88  17.0     1     4     4  21  
#>  3     4  108     93  3.85  2.32  18.6     1     4     1  22.8
#>  4     6  258    110  3.08  3.22  19.4     0     3     1  21.4
#>  5     8  360    175  3.15  3.44  17.0     0     3     2  18.7
#>  6     6  225    105  2.76  3.46  20.2     0     3     1  18.1
#>  7     8  360    245  3.21  3.57  15.8     0     3     4  14.3
#>  8     4  147.    62  3.69  3.19  20       0     4     2  24.4
#>  9     4  141.    95  3.92  3.15  22.9     0     4     2  22.8
#> 10     6  168.   123  3.92  3.44  18.3     0     4     4  19.2
#> # ℹ 22 more rows

rec |>
  bake(new_data = mtcars |> select(-mpg))
#> Error in `step_select()`:
#> ! The following required column is missing from `new_data` in step
#>   'select_VFmgH': mpg.
#> Backtrace:
#>     ▆
#>  1. ├─recipes::bake(rec, new_data = select(mtcars, -mpg))
#>  2. └─recipes:::bake.recipe(rec, new_data = select(mtcars, -mpg)) at recipes/R/recipe.R:528:2
#>  3.   ├─recipes::bake(step, new_data = new_data) at recipes/R/recipe.R:651:4
#>  4.   └─recipes:::bake.step_select(step, new_data = new_data) at recipes/R/recipe.R:528:2
#>  5.     └─recipes::check_new_data(object$terms, object, new_data) at recipes/R/select.R:103:2
#>  6.       └─cli::cli_abort(...) at recipes/R/misc.R:795:2
#>  7.         └─rlang::abort(...)

EmilHvitfeldt avatar Mar 30 '23 21:03 EmilHvitfeldt