recipes icon indicating copy to clipboard operation
recipes copied to clipboard

Consider options for error messages at `bake()` time about `skip = TRUE`

Open juliasilge opened this issue 3 years ago • 1 comments
trafficstars

We have a good bit of documentation about skipping vs. not skipping:

  • https://www.tmwr.org/recipes.html#skip-equals-true
  • https://recipes.tidymodels.org/articles/Skipping.html
  • the individual function pages, etc

However, people continue to have a hard time with this (see #961, #943, #714, and more) and it seems to be one of the more common problems that people trip up on. The error messages that users see are often quite hard to interpret:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

recipe(mpg ~ cyl, data = mtcars) %>%
  step_log(mpg) %>%
  prep() %>%
  bake(new_data = tibble(cyl = 4))
#> Error:
#> ! Assigned data `log(new_data[[col_names[i]]] + object$offset, base = object$base)` must be compatible with existing data.
#> ✖ Existing data has 1 row.
#> ✖ Assigned data has 0 rows.
#> ℹ Row updates require a list value. Do you need `list()` or `as.list()`?

Created on 2022-04-25 by the reprex package (v2.0.1)

We've discussed this before and there is no easy or obviously good way to check the columns before actually baking. What can we do to get to better error messages for this? 🤔

🎯 One idea for discussion: a new function along the lines of check_nominal_type() or check_training_set() that we could use to check that the needed columns are in new_data while baking a step.

🎯 Another idea: Should there be a bake_step for all steps to go through before bake.step_log, bake.step_dummy, etc methods, to handle step-level, widely applicable ops like this? Is there anything else we need like this, while we are at it? Maybe related to #797?

juliasilge avatar Apr 25 '22 22:04 juliasilge

I personally like the second option, Ideally it would be nice if we should add the logic here, around/after line 600.

https://github.com/tidymodels/recipes/blob/d26981702fbdf3826d3b150fc87884c50fab1de3/R/recipe.R#L595-L607

The main roadblock for that right now, is that there isn't a consistent way to determine what columns are used in a given bake step. For example, step_normalize() uses the names of the means object to filter columns and step_sqrt() uses the columns object to filter columns.

library(recipes)

rec_spec <- recipe(mpg ~ ., data = mtcars) %>%
  step_normalize(all_predictors()) %>%
  step_sqrt(mpg) %>%
  prep()

rec_spec$steps[[1]]$means |> names()
#>  [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

rec_spec$steps[[2]]$columns
#>   mpg 
#> "mpg"

EmilHvitfeldt avatar Apr 25 '22 23:04 EmilHvitfeldt