recipes `check_training_set()` should probably loosely check `training` column types

trafficstars

If you specify a recipe() with a particular data frame, then prep() with a different data frame, check_training_data() does a few checks to make sure the two are compatible. In particular, right now it checks that the column names of the template data frame used in recipe() also exist in the data you prep() with. However, it doesn't check the types at all, so you can end up with odd errors like this one:

library(recipes)

df <- tibble(x = 1, y = "1")

# Original template `df` has a character `y`
rec <- recipe(~ ., df) %>%
  step_bin2factor(where(is.numeric))

# But when we prep we have a numeric `y`
df <- tibble(x = 1, y = 1)

# Awkward error
prep(rec, training = df, fresh = TRUE)
#> Error: The variables should be numeric

# The tidyselect bits selected `x` and `y` because they are both numeric,
# but in the original template data, `y` was nominal, which is why we get this error
rec$var_info
#> # A tibble: 2 × 4
#>   variable type    role      source  
#>   <chr>    <chr>   <chr>     <chr>   
#> 1 x        numeric predictor original
#> 2 y        nominal predictor original

I wonder if check_training_set() should also check the types a little somehow. It could check that the columns supplied through prep() are numeric or nominal if the original template supplied in recipe() had that type.

Sep 14 '21 16:09 DavisVaughan

This isn't a huge deal in practice, because workflows/hardhat typically handle the type consistency ahead of time

Sep 14 '21 16:09 DavisVaughan

slightly bigger deal because {workflows}/{hardhat} doesn't deal with this.

Basically, we are in trouble if the data that is pased to data argument in recipe() doesn't match the data set that is used to prep/fit()

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

traning <- tibble(outcome = rnorm(1000), x = rnorm(1000), y = sample(letters, 1000, T))

rec1 <- recipe(outcome ~ ., traning) %>%
  step_bin2factor(where(is.numeric), - all_outcomes())

# Original template `traning` has a character `y`
rec2 <- recipe(outcome ~ ., traning) %>%
  step_bin2factor(all_numeric_predictors())

# But when we prep we have a numeric `y`
testing <- tibble(outcome = rnorm(1000), x = rnorm(1000), y = rnorm(1000))

# no longer error
prep(rec1, training = testing, fresh = TRUE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x and y | Trained
prep(rec2, training = testing, fresh = TRUE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x | Trained

prep(rec1, training = testing, fresh = FALSE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x and y | Trained
prep(rec2, training = testing, fresh = FALSE)
#> 
#> ── Recipe ──────────────────────────────────────────────────────────────────────
#> 
#> ── Inputs
#> Number of variables by role
#> outcome:   1
#> predictor: 2
#> 
#> ── Training information
#> Training data contained 1000 data points and no incomplete rows.
#> 
#> ── Operations
#> • Dummy variable to factor conversion for: x | Trained

rec1$var_info$type
#> [[1]]
#> [1] "double"  "numeric"
#> 
#> [[2]]
#> [1] "string"    "unordered" "nominal"  
#> 
#> [[3]]
#> [1] "double"  "numeric"
rec2$var_info$type
#> [[1]]
#> [1] "double"  "numeric"
#> 
#> [[2]]
#> [1] "string"    "unordered" "nominal"  
#> 
#> [[3]]
#> [1] "double"  "numeric"

library(tidymodels)

wf1 <- workflow(rec1, linear_reg())
wf2 <- workflow(rec2, linear_reg())

fit(wf1, testing)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

fit(wf2, testing)
#> Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

^{Created on 2024-05-30 with reprex v2.1.0}

May 30 '24 22:05 EmilHvitfeldt

what we need are ptype for input.

a partially prepped recipe does not guarantee the information about the input data. Therefore there are no way to check whether the data is identical as we don't store the ptype of the original dataset past its names 😞

library(recipes)

rec <- recipe(~ ., data = mtcars) |>
  step_pca(all_predictors()) |>
  prep()

rec$template
#> # A tibble: 32 × 5
#>      PC1   PC2    PC3     PC4    PC5
#>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#>  1 -195. -12.8  11.4  -0.0164  2.17 
#>  2 -195. -12.9  11.7   0.479   2.11 
#>  3 -142. -25.9  16.0   1.34   -1.18 
#>  4 -279.  38.3  14.0  -0.157  -0.817
#>  5 -399.  37.3   1.38 -2.56   -0.444
#>  6 -248.  25.6  12.2   3.01   -1.08 
#>  7 -435. -20.9 -13.9  -0.801  -0.916
#>  8 -160.  20.0  23.3   1.06    0.787
#>  9 -172. -10.8  18.3   4.40   -0.836
#> 10 -209. -19.7   8.94  2.58    1.33 
#> # ℹ 22 more rows
rec$var_info
#> # A tibble: 11 × 4
#>    variable type      role      source  
#>    <chr>    <list>    <chr>     <chr>   
#>  1 mpg      <chr [2]> predictor original
#>  2 cyl      <chr [2]> predictor original
#>  3 disp     <chr [2]> predictor original
#>  4 hp       <chr [2]> predictor original
#>  5 drat     <chr [2]> predictor original
#>  6 wt       <chr [2]> predictor original
#>  7 qsec     <chr [2]> predictor original
#>  8 vs       <chr [2]> predictor original
#>  9 am       <chr [2]> predictor original
#> 10 gear     <chr [2]> predictor original
#> 11 carb     <chr [2]> predictor original

rec |>
  step_normalize(all_predictors()) |>
  prep(mtcars |> mutate(vs = as.logical(vs)), fresh = TRUE)
#> Error in `step_pca()`:
#> Caused by error in `prep()`:
#> ✖ All columns selected for the step should be double or integer.
#> • 1 logical variable found: `vs`

^{Created on 2024-05-30 with reprex v2.1.0}

May 31 '24 05:05 EmilHvitfeldt

We need ptype information from https://github.com/tidymodels/recipes/pull/1329 to be able to handle this issue properly

Jun 01 '24 01:06 EmilHvitfeldt

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

Jun 22 '24 00:06 github-actions[bot]

recipes recipes copied to clipboard

`check_training_set()` should probably loosely check `training` column types

recipes
recipes copied to clipboard