recipes
recipes copied to clipboard
Deselecting a variable with `step_select(-my_variable)` causes prediction to fail
A surprising error on step_select. If you deselect a predictor using step_select, then fit a model with the recipe and try to predict, it will yield an error message
Error in `step_select()`:
! The following required column is missing from `new_data` in step 'select_***': ***.
Run `rlang::last_error()` to see where the error occurred.
I'm guessing that the reason for that is that using the notation with a "-" sign actually what the step is doing is selecting all other variables, among them the outcome variable, which is not supposed to be a part of the prediction. Also see this related post.
For me, an improved error message would have saved 30 minutes of trying to figure out the problem. I.e., add "did you remember to skip=TRUE?" to the step_skip function.
A more bullet-proof solution would be to always ignore the outcome variable when predicting, if the step_select was used to deselect stuff.
Here is a reprex:
library(tidyverse)
library(tidymodels)
set.seed(42)
mtcars_split <- initial_split(mtcars)
training_lm <- training(mtcars_split)
testing_lm <- testing(mtcars_split)
lm_recipe <- recipe(mpg ~ ., training(mtcars_split)) %>%
step_select(-carb) # <--- *** HERE IS THE PROBLEM *** When adding skip=TRUE it works around the issue
lm_spec <- linear_reg() %>%
set_engine("lm")
lm_model <- workflow() %>%
add_recipe(lm_recipe) %>%
add_model(lm_spec) %>%
fit(testing_lm)
lm_model %>%
augment(testing_lm)
The output:
Error in `step_select()`:
! The following required column is missing from `new_data` in step 'select_BNc5G': mpg.
Run `rlang::last_error()` to see where the error occurred.
and my session info:
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C LC_TIME=English_United States.utf8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] yardstick_1.0.0 workflowsets_1.0.0 workflows_1.0.0 tune_1.0.0 rsample_1.0.0 recipes_1.0.1 parsnip_1.0.0
[8] modeldata_1.0.0 infer_1.0.2 dials_1.0.0 scales_1.2.0 broom_1.0.0 tidymodels_1.0.0 forcats_0.5.1
[15] stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
[22] tidyverse_1.3.1
loaded via a namespace (and not attached):
[1] httr_1.4.2 foreach_1.5.2 jsonlite_1.8.0 splines_4.2.0 prodlim_2019.11.13 modelr_0.1.8 assertthat_0.2.1
[8] GPfit_1.0-8 cellranger_1.1.0 globals_0.14.0 ipred_0.9-13 pillar_1.7.0 backports_1.4.1 lattice_0.20-45
[15] glue_1.6.2 digest_0.6.29 rvest_1.0.2 hardhat_1.2.0 colorspace_2.0-3 Matrix_1.4-1 timeDate_4021.104
[22] pkgconfig_2.0.3 lhs_1.1.5 DiceDesign_1.9 listenv_0.8.0 haven_2.5.0 gower_1.0.0 lava_1.6.10
[29] tzdb_0.3.0 generics_0.1.2 ellipsis_0.3.2 furrr_0.3.0 withr_2.5.0 nnet_7.3-17 cli_3.3.0
[36] survival_3.3-1 magrittr_2.0.3 crayon_1.5.1 readxl_1.4.0 fs_1.5.2 fansi_1.0.3 future_1.25.0
[43] parallelly_1.31.1 MASS_7.3-56 xml2_1.3.3 class_7.3-20 tools_4.2.0 hms_1.1.1 lifecycle_1.0.1
[50] munsell_0.5.0 reprex_2.0.1 compiler_4.2.0 rlang_1.0.4 grid_4.2.0 iterators_1.0.14 rstudioapi_0.13
[57] gtable_0.3.0 codetools_0.2-18 DBI_1.1.3 R6_2.5.1 lubridate_1.8.0 future.apply_1.9.0 utf8_1.2.2
[64] stringi_1.7.6 parallel_4.2.0 Rcpp_1.0.8.3 vctrs_0.4.1 rpart_4.1.16 dbplyr_2.1.1 tidyselect_1.1.2
Generally we advice that you use step_rm() instead of step_select() to remove variables for some of these reasons 🙂
Sounds reasonable and I will do it with step_rm moving forward.
Intuitively (as someone moving to tidymodels, fluent in dplyr), using step_select to deselect variable was the most logical thing to try.
Is there a way to improve the error message, to make it clearer to new users?
Is there a way to improve the error message, to make it clearer to new users?
We will see what we can do!
This is also related to https://github.com/tidymodels/recipes/issues/741
More minimal reprex:
library(recipes)
rec <- recipe(mpg ~ ., data = mtcars) |>
step_select(-vs) |>
prep()
rec |>
bake(new_data = mtcars)
#> # A tibble: 32 × 10
#> cyl disp hp drat wt qsec am gear carb mpg
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 160 110 3.9 2.62 16.5 1 4 4 21
#> 2 6 160 110 3.9 2.88 17.0 1 4 4 21
#> 3 4 108 93 3.85 2.32 18.6 1 4 1 22.8
#> 4 6 258 110 3.08 3.22 19.4 0 3 1 21.4
#> 5 8 360 175 3.15 3.44 17.0 0 3 2 18.7
#> 6 6 225 105 2.76 3.46 20.2 0 3 1 18.1
#> 7 8 360 245 3.21 3.57 15.8 0 3 4 14.3
#> 8 4 147. 62 3.69 3.19 20 0 4 2 24.4
#> 9 4 141. 95 3.92 3.15 22.9 0 4 2 22.8
#> 10 6 168. 123 3.92 3.44 18.3 0 4 4 19.2
#> # ℹ 22 more rows
rec |>
bake(new_data = mtcars |> select(-mpg))
#> Error in `step_select()`:
#> ! The following required column is missing from `new_data` in step
#> 'select_VFmgH': mpg.
#> Backtrace:
#> ▆
#> 1. ├─recipes::bake(rec, new_data = select(mtcars, -mpg))
#> 2. └─recipes:::bake.recipe(rec, new_data = select(mtcars, -mpg)) at recipes/R/recipe.R:528:2
#> 3. ├─recipes::bake(step, new_data = new_data) at recipes/R/recipe.R:651:4
#> 4. └─recipes:::bake.step_select(step, new_data = new_data) at recipes/R/recipe.R:528:2
#> 5. └─recipes::check_new_data(object$terms, object, new_data) at recipes/R/select.R:103:2
#> 6. └─cli::cli_abort(...) at recipes/R/misc.R:795:2
#> 7. └─rlang::abort(...)