recipes
recipes copied to clipboard
`check_training_set()` should probably loosely check `training` column types
If you specify a recipe() with a particular data frame, then prep() with a different data frame, check_training_data() does a few checks to make sure the two are compatible. In particular, right now it checks that the column names of the template data frame used in recipe() also exist in the data you prep() with. However, it doesn't check the types at all, so you can end up with odd errors like this one:
library(recipes)
df <- tibble(x = 1, y = "1")
# Original template `df` has a character `y`
rec <- recipe(~ ., df) %>%
step_bin2factor(where(is.numeric))
# But when we prep we have a numeric `y`
df <- tibble(x = 1, y = 1)
# Awkward error
prep(rec, training = df, fresh = TRUE)
#> Error: The variables should be numeric
# The tidyselect bits selected `x` and `y` because they are both numeric,
# but in the original template data, `y` was nominal, which is why we get this error
rec$var_info
#> # A tibble: 2 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 x numeric predictor original
#> 2 y nominal predictor original
I wonder if check_training_set() should also check the types a little somehow. It could check that the columns supplied through prep() are numeric or nominal if the original template supplied in recipe() had that type.
This isn't a huge deal in practice, because workflows/hardhat typically handle the type consistency ahead of time
slightly bigger deal because {workflows}/{hardhat} doesn't deal with this.
Basically, we are in trouble if the data that is pased to data argument in recipe() doesn't match the data set that is used to prep/fit()
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
traning <- tibble(outcome = rnorm(1000), x = rnorm(1000), y = sample(letters, 1000, T))
rec1 <- recipe(outcome ~ ., traning) %>%
step_bin2factor(where(is.numeric), - all_outcomes())
# Original template `traning` has a character `y`
rec2 <- recipe(outcome ~ ., traning) %>%
step_bin2factor(all_numeric_predictors())
# But when we prep we have a numeric `y`
testing <- tibble(outcome = rnorm(1000), x = rnorm(1000), y = rnorm(1000))
# no longer error
prep(rec1, training = testing, fresh = TRUE)
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 2
#>
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#>
#> ── Operations
#> • Dummy variable to factor conversion for: x and y | Trained
prep(rec2, training = testing, fresh = TRUE)
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 2
#>
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#>
#> ── Operations
#> • Dummy variable to factor conversion for: x | Trained
prep(rec1, training = testing, fresh = FALSE)
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 2
#>
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#>
#> ── Operations
#> • Dummy variable to factor conversion for: x and y | Trained
prep(rec2, training = testing, fresh = FALSE)
#>
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#>
#> ── Inputs
#> Number of variables by role
#> outcome: 1
#> predictor: 2
#>
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#>
#> ── Operations
#> • Dummy variable to factor conversion for: x | Trained
rec1$var_info$type
#> [[1]]
#> [1] "double" "numeric"
#>
#> [[2]]
#> [1] "string" "unordered" "nominal"
#>
#> [[3]]
#> [1] "double" "numeric"
rec2$var_info$type
#> [[1]]
#> [1] "double" "numeric"
#>
#> [[2]]
#> [1] "string" "unordered" "nominal"
#>
#> [[3]]
#> [1] "double" "numeric"
library(tidymodels)
wf1 <- workflow(rec1, linear_reg())
wf2 <- workflow(rec2, linear_reg())
fit(wf1, testing)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
fit(wf2, testing)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels
Created on 2024-05-30 with reprex v2.1.0
what we need are ptype for input.
a partially prepped recipe does not guarantee the information about the input data. Therefore there are no way to check whether the data is identical as we don't store the ptype of the original dataset past its names 😞
library(recipes)
rec <- recipe(~ ., data = mtcars) |>
step_pca(all_predictors()) |>
prep()
rec$template
#> # A tibble: 32 × 5
#> PC1 PC2 PC3 PC4 PC5
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -195. -12.8 11.4 -0.0164 2.17
#> 2 -195. -12.9 11.7 0.479 2.11
#> 3 -142. -25.9 16.0 1.34 -1.18
#> 4 -279. 38.3 14.0 -0.157 -0.817
#> 5 -399. 37.3 1.38 -2.56 -0.444
#> 6 -248. 25.6 12.2 3.01 -1.08
#> 7 -435. -20.9 -13.9 -0.801 -0.916
#> 8 -160. 20.0 23.3 1.06 0.787
#> 9 -172. -10.8 18.3 4.40 -0.836
#> 10 -209. -19.7 8.94 2.58 1.33
#> # ℹ 22 more rows
rec$var_info
#> # A tibble: 11 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 mpg <chr [2]> predictor original
#> 2 cyl <chr [2]> predictor original
#> 3 disp <chr [2]> predictor original
#> 4 hp <chr [2]> predictor original
#> 5 drat <chr [2]> predictor original
#> 6 wt <chr [2]> predictor original
#> 7 qsec <chr [2]> predictor original
#> 8 vs <chr [2]> predictor original
#> 9 am <chr [2]> predictor original
#> 10 gear <chr [2]> predictor original
#> 11 carb <chr [2]> predictor original
rec |>
step_normalize(all_predictors()) |>
prep(mtcars |> mutate(vs = as.logical(vs)), fresh = TRUE)
#> Error in `step_pca()`:
#> Caused by error in `prep()`:
#> ✖ All columns selected for the step should be double or integer.
#> • 1 logical variable found: `vs`
Created on 2024-05-30 with reprex v2.1.0
We need ptype information from https://github.com/tidymodels/recipes/pull/1329 to be able to handle this issue properly
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.