recipes
recipes copied to clipboard
Consider how `step_select()` handles outcomes and predictors
The problem
If model is fit for workflows::workflow() and recipe steps include step_select, then predict.workflow fails.
However, the same recipe works fine without using workflows package.
Reproducible example
library(magrittr)
preprocessor <- recipes::recipe(mtcars, mpg ~ .) %>%
recipes::step_select(mpg, wt)
model <- parsnip::linear_reg() %>%
parsnip::set_engine("lm")
# with workflows
workflows::workflow() %>%
workflows::add_recipe(preprocessor) %>%
workflows::add_model(model) %>%
generics::fit(mtcars) %>%
predict(mtcars)
#> Error: Can't subset columns that don't exist.
#> x Column `mpg` doesn't exist.
# without workflows
input <- preprocessor %>%
recipes::prep() %>%
recipes::juice()
model %>%
generics::fit(mpg ~ ., data = input) %>%
predict(input)
#> # A tibble: 32 x 1
#> .pred
#> <dbl>
#> 1 23.3
#> 2 21.9
#> 3 24.9
#> 4 20.1
#> 5 18.9
#> 6 18.8
#> 7 18.2
#> 8 20.2
#> 9 20.5
#> 10 18.9
#> # … with 22 more rows
Created on 2021-07-01 by the reprex package (v2.0.0)
Session Info
``` r
sessioninfo::session_info(c("workflows", "recipes", "parsnip"))
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.1.0 (2021-05-18)
#> os Ubuntu 20.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Etc/UTC
#> date 2021-07-01
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> class 7.3-19 2021-05-03 [2] CRAN (R 4.1.0)
#> cli 2.5.0 2021-04-26 [1] RSPM (R 4.1.0)
#> codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.0)
#> cpp11 0.3.1 2021-06-25 [1] RSPM (R 4.1.0)
#> crayon 1.4.1 2021-02-08 [1] RSPM (R 4.1.0)
#> dplyr 1.0.7 2021-06-18 [1] RSPM (R 4.1.0)
#> ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0)
#> fansi 0.5.0 2021-05-25 [1] RSPM (R 4.1.0)
#> generics 0.1.0 2020-10-31 [1] RSPM (R 4.1.0)
#> globals 0.14.0 2020-11-22 [1] RSPM (R 4.1.0)
#> glue 1.4.2 2020-08-27 [1] RSPM (R 4.1.0)
#> gower 0.2.2 2020-06-23 [1] RSPM (R 4.1.0)
#> hardhat 0.1.5 2020-11-09 [1] RSPM (R 4.1.0)
#> ipred 0.9-11 2021-03-12 [1] RSPM (R 4.1.0)
#> KernSmooth 2.23-20 2021-05-03 [2] CRAN (R 4.1.0)
#> lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.0)
#> lava 1.6.9 2021-03-11 [1] RSPM (R 4.1.0)
#> lifecycle 1.0.0 2021-02-15 [1] RSPM (R 4.1.0)
#> lubridate 1.7.10 2021-02-26 [1] RSPM (R 4.1.0)
#> magrittr 2.0.1 2020-11-17 [1] RSPM (R 4.1.0)
#> MASS 7.3-54 2021-05-03 [2] CRAN (R 4.1.0)
#> Matrix 1.3-3 2021-05-04 [2] CRAN (R 4.1.0)
#> nnet 7.3-16 2021-05-03 [2] CRAN (R 4.1.0)
#> numDeriv 2016.8-1.1 2019-06-06 [1] RSPM (R 4.1.0)
#> parsnip 0.1.6.9000 2021-07-01 [1] Github (tidymodels/parsnip@89f8f93)
#> pillar 1.6.1 2021-05-16 [1] RSPM (R 4.1.0)
#> pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0)
#> prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.1.0)
#> prodlim 2019.11.13 2019-11-17 [1] RSPM (R 4.1.0)
#> purrr 0.3.4 2020-04-17 [1] RSPM (R 4.1.0)
#> R6 2.5.0 2020-10-28 [1] RSPM (R 4.1.0)
#> Rcpp 1.0.6 2021-01-15 [1] RSPM (R 4.1.0)
#> recipes 0.1.16.9000 2021-07-01 [1] Github (tidymodels/recipes@39bc4e8)
#> rlang 0.4.11 2021-04-30 [1] RSPM (R 4.1.0)
#> rpart 4.1-15 2019-04-12 [2] CRAN (R 4.1.0)
#> SQUAREM 2021.1 2021-01-13 [1] RSPM (R 4.1.0)
#> survival 3.2-11 2021-04-26 [2] CRAN (R 4.1.0)
#> tibble 3.1.2 2021-05-16 [1] RSPM (R 4.1.0)
#> tidyr 1.1.3 2021-03-03 [1] RSPM (R 4.1.0)
#> tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0)
#> timeDate 3043.102 2018-02-21 [1] RSPM (R 4.1.0)
#> utf8 1.2.1 2021-03-12 [1] RSPM (R 4.1.0)
#> vctrs 0.3.8 2021-04-29 [1] RSPM (R 4.1.0)
#> withr 2.4.2 2021-04-18 [1] RSPM (R 4.1.0)
#> workflows 0.2.2.9000 2021-07-01 [1] Github (tidymodels/workflows@8ad5a9d)
#>
#> [1] /usr/local/lib/R/site-library
#> [2] /usr/local/lib/R/library
Created on 2021-07-01 by the reprex package (v2.0.0)
</details>
The problem is that you have used step_select() on your outcome with the default skip = FALSE. (You can read more about skipping steps for new data here, but you don't want to skip for the predictor here so I don't think that will help.)
The workflows package is very careful about separating predictors and outcomes to avoid data leakage; at prediction time, the outcome is not available, as a protection to all of us as users. This recipe you made says: "try to select the outcome" but the outcome is not available at prediction time. This is by design and is a feature of recipes + workflows.
If you can describe the real-world use case where you have run into this with a bit more detail, we can offer some advice for a solution, beyond, say, recipe(mpg ~ wt, data = mtcars).
There does seem to be a little tension between this behavior and step_select(). Since step_select() requires you to specify the outcome to be able to keep it, to select just the numeric predictors you have to do step_select(outcome, all_numeric_predictors()). You also can't apply them in separate steps like:
rec %>%
step_select(outcome, skip = TRUE) %>%
step_select(all_numeric_predictors())
This doesn't work because the first selection will only keep the outcome, so the second one won't work correctly.
So I'm not sure you can use step_select() in combination with workflows/hardhat right now as is?
I wonder if we should have called it step_select_predictors(), or just forced step_select() to only be applied on the predictors? i.e. you'd always get the outcome columns, you never have to select them and can't select them away. That way at bake() time it never tries to select an outcome column, so we don't have issues here in workflows' predict() method.
That's a great point; I'm going to move this to recipes and we can consider if we want to make any changes to step_select(), or better document that it's not a great choice for use in a modeling analysis (i.e. use update_role() without the formula interface or something like that, depending on what you are trying to do).
Hi,
I just ran into this issue as well and it took me some time to figure out what's going on. It would be great to document this or update the behaviour or even create a new step (e.g. step_select_predictors as suggested above).
In the meantime, there is a workaround that I have used using step_rm if somebody runs into the same issue... :)
step_rm(recipe, all_predictors(), -all_of(<variables that you want to keep>))
Thanks for the great tidymodels framework, great stuff!