recipes icon indicating copy to clipboard operation
recipes copied to clipboard

step_novel() doesn't work for a value not seen on training data factor if they're a factor level

Open bcadenato opened this issue 2 years ago • 1 comments

The problem

I think this might be a subtle one. If a training set:

  • contains a variable that is a factor
  • the factor knows a value as a potential level
  • the training set doesn't contain an observation that has that value

When trying to predict with lm on a data set with an observation that has that value, predict() will exit with an error. This actually happened to me with a data set in modeldata.

I learnt about step_novel() and assumed this would be enough to manage this situation. However step_novel() will not do anything if the missing value in the training data set is a known value for the factor (i.e. it's part of the set of levels).

However if I remove the value from the set of levels, predict() will throw a warning, and step_novel() will work. Full reprex below to reproduce this behaviour.

Considerations

I appreciate that there are more profound considerations at play here: I could stratify my data set when splitting it between training and testing, I could reset the levels of the factor to accommodate those in the training data set, etc.

However I also think that there's something more subtle about the expectations on step_novel() behaviour that would make sense for the function to meet, i.e. if a value is not present in the training data set, that value should be transformed into another value such as new.

Alternatively the models supported by tidymodels framework maybe should handle this situation gracefully without an error.

Reproducible example

library(tidyverse)
library(tidymodels)

data(Sacramento)

# Create a training set without ANTELOPE as city value 
# and a test set with ANTELOPE as a city value

sacr_tr <- Sacramento %>% 
    filter(! city %in% c("ANTELOPE"))

sacr_te <- Sacramento %>% 
    filter(city %in% c("ANTELOPE"))

# Create a workflow that uses step_novel in the recipe, and fit the model

rec <- recipe(
    price ~ city,
    data = sacr_tr) %>% 
    step_novel(city)

mod <- linear_reg() %>% 
    set_engine("lm") %>% 
    set_mode("regression")

wf <- workflow() %>% 
    add_recipe(rec) %>% 
    add_model(mod)

wf_fit <- wf %>% 
    fit(sacr_tr)

# The model cannot predict on the test set because it had not seen ANTELOPE before as a value, 
# even if ANTELOPE is a level it knows

wf_pred <- wf_fit %>% 
    predict(sacr_te)
#> Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor city has new level ANTELOPE

# Remove ANTELOPE level from city set of levels in the training set
# and refit the model with the resulting training set

sacr_tr_fct <- sacr_tr %>% 
    mutate(
        city = city %>% 
            as.character() %>% 
            factor())

rec_fct <- recipe(
    price ~ city,
    data = sacr_tr_fct) %>% 
    step_novel(city)

wf_fct <- wf %>% 
    update_recipe(
        rec_fct)

wf_fct_fit <- wf_fct %>% 
    fit(sacr_tr_fct)

# The model can predict without errors even if it cannot make a prediction
# ANTELOPE level is converted to `new` level and the model can manage it

wf_fct_pred <- wf_fct_fit %>% 
    predict(sacr_te)
#> Warning: Novel levels found in column 'city': 'ANTELOPE'. The levels have been
#> removed, and values have been coerced to 'NA'.

# If the training set doesn't have ANTELOPE as a level, step_novel can
# transform it to the value `new` as expected

wf_fit %>% 
    extract_recipe() %>% 
    bake(sacr_te)
#> # A tibble: 33 × 2
#>    city      price
#>    <fct>     <int>
#>  1 ANTELOPE 126640
#>  2 ANTELOPE 161250
#>  3 ANTELOPE 182716
#>  4 ANTELOPE 194818
#>  5 ANTELOPE 387731
#>  6 ANTELOPE 165000
#>  7 ANTELOPE 180000
#>  8 ANTELOPE 200000
#>  9 ANTELOPE 255000
#> 10 ANTELOPE 261000
#> # ℹ 23 more rows

wf_fct_fit %>% 
    extract_recipe() %>% 
    bake(sacr_te)
#> # A tibble: 33 × 2
#>    city   price
#>    <fct>  <int>
#>  1 new   126640
#>  2 new   161250
#>  3 new   182716
#>  4 new   194818
#>  5 new   387731
#>  6 new   165000
#>  7 new   180000
#>  8 new   200000
#>  9 new   255000
#> 10 new   261000
#> # ℹ 23 more rows

Created on 2023-10-31 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22)
#>  os       macOS Big Sur/Monterey 10.16
#>  system   x86_64, darwin17.0
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Madrid
#>  date     2023-10-31
#>  pandoc   3.1.9 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version    date (UTC) lib source
#>  backports      1.4.1      2021-12-13 [1] CRAN (R 4.2.0)
#>  broom        * 1.0.4      2023-03-11 [1] CRAN (R 4.2.0)
#>  class          7.3-21     2023-01-23 [1] CRAN (R 4.2.0)
#>  cli            3.6.1      2023-03-23 [1] CRAN (R 4.2.0)
#>  codetools      0.2-19     2023-02-01 [1] CRAN (R 4.2.0)
#>  colorspace     2.1-0      2023-01-23 [1] CRAN (R 4.2.0)
#>  data.table     1.14.8     2023-02-17 [1] CRAN (R 4.2.0)
#>  dials        * 1.2.0      2023-04-03 [1] CRAN (R 4.2.0)
#>  DiceDesign     1.9        2021-02-13 [1] CRAN (R 4.2.0)
#>  digest         0.6.31     2022-12-11 [1] CRAN (R 4.2.0)
#>  dplyr        * 1.1.2      2023-04-20 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate       0.20       2023-01-17 [1] CRAN (R 4.2.0)
#>  fansi          1.0.4      2023-01-22 [1] CRAN (R 4.2.0)
#>  fastmap        1.1.1      2023-02-24 [1] CRAN (R 4.2.0)
#>  forcats      * 1.0.0      2023-01-29 [1] CRAN (R 4.2.0)
#>  foreach        1.5.2      2022-02-02 [1] CRAN (R 4.2.0)
#>  fs             1.6.2      2023-04-25 [1] CRAN (R 4.2.0)
#>  furrr          0.3.1      2022-08-15 [1] CRAN (R 4.2.0)
#>  future         1.32.0     2023-03-07 [1] CRAN (R 4.2.0)
#>  future.apply   1.10.0     2022-11-05 [1] CRAN (R 4.2.0)
#>  generics       0.1.3      2022-07-05 [1] CRAN (R 4.2.0)
#>  ggplot2      * 3.4.2      2023-04-03 [1] CRAN (R 4.2.0)
#>  globals        0.16.2     2022-11-21 [1] CRAN (R 4.2.0)
#>  glue           1.6.2      2022-02-24 [1] CRAN (R 4.2.0)
#>  gower          1.0.1      2022-12-22 [1] CRAN (R 4.2.0)
#>  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.2.0)
#>  gtable         0.3.3      2023-03-21 [1] CRAN (R 4.2.0)
#>  hardhat        1.3.0      2023-03-30 [1] CRAN (R 4.2.0)
#>  hms            1.1.3      2023-03-21 [1] CRAN (R 4.2.0)
#>  htmltools      0.5.5      2023-03-23 [1] CRAN (R 4.2.0)
#>  infer        * 1.0.4      2022-12-02 [1] CRAN (R 4.2.0)
#>  ipred          0.9-14     2023-03-09 [1] CRAN (R 4.2.0)
#>  iterators      1.0.14     2022-02-05 [1] CRAN (R 4.2.0)
#>  knitr          1.42       2023-01-25 [1] CRAN (R 4.2.0)
#>  lattice        0.21-8     2023-04-05 [1] CRAN (R 4.2.0)
#>  lava           1.7.2.1    2023-02-27 [1] CRAN (R 4.2.0)
#>  lhs            1.1.6      2022-12-17 [1] CRAN (R 4.2.0)
#>  lifecycle      1.0.3      2022-10-07 [1] CRAN (R 4.2.0)
#>  listenv        0.9.0      2022-12-16 [1] CRAN (R 4.2.0)
#>  lubridate    * 1.9.2      2023-02-10 [1] CRAN (R 4.2.0)
#>  magrittr       2.0.3      2022-03-30 [1] CRAN (R 4.2.0)
#>  MASS           7.3-59     2023-04-21 [1] CRAN (R 4.2.0)
#>  Matrix         1.5-4      2023-04-04 [1] CRAN (R 4.2.0)
#>  modeldata    * 1.2.0      2023-08-09 [1] CRAN (R 4.2.0)
#>  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.2.0)
#>  nnet           7.3-18     2022-09-28 [1] CRAN (R 4.2.0)
#>  parallelly     1.35.0     2023-03-23 [1] CRAN (R 4.2.0)
#>  parsnip      * 1.1.0      2023-04-12 [1] CRAN (R 4.2.0)
#>  pillar         1.9.0      2023-03-22 [1] CRAN (R 4.2.0)
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.2.0)
#>  prodlim        2023.03.31 2023-04-02 [1] CRAN (R 4.2.0)
#>  purrr        * 1.0.1      2023-01-10 [1] CRAN (R 4.2.0)
#>  R.cache        0.16.0     2022-07-21 [1] CRAN (R 4.2.0)
#>  R.methodsS3    1.8.2      2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0     2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.2     2022-11-11 [1] CRAN (R 4.2.0)
#>  R6             2.5.1      2021-08-19 [1] CRAN (R 4.2.0)
#>  Rcpp           1.0.10     2023-01-22 [1] CRAN (R 4.2.0)
#>  readr        * 2.1.4      2023-02-10 [1] CRAN (R 4.2.0)
#>  recipes      * 1.0.6      2023-04-25 [1] CRAN (R 4.2.0)
#>  reprex         2.0.2      2022-08-17 [1] CRAN (R 4.2.0)
#>  rlang          1.1.1      2023-04-28 [1] CRAN (R 4.2.0)
#>  rmarkdown      2.21       2023-03-26 [1] CRAN (R 4.2.0)
#>  rpart          4.1.19     2022-10-21 [1] CRAN (R 4.2.0)
#>  rsample      * 1.2.0      2023-08-23 [1] CRAN (R 4.2.0)
#>  rstudioapi     0.14       2022-08-22 [1] CRAN (R 4.2.0)
#>  scales       * 1.2.1      2022-08-20 [1] CRAN (R 4.2.0)
#>  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi        1.7.12     2023-01-11 [1] CRAN (R 4.2.0)
#>  stringr      * 1.5.0      2022-12-02 [1] CRAN (R 4.2.0)
#>  styler         1.10.2     2023-08-29 [1] CRAN (R 4.2.0)
#>  survival       3.5-5      2023-03-12 [1] CRAN (R 4.2.0)
#>  tibble       * 3.2.1      2023-03-20 [1] CRAN (R 4.2.0)
#>  tidymodels   * 1.0.0      2022-07-13 [1] CRAN (R 4.2.0)
#>  tidyr        * 1.3.0      2023-01-24 [1] CRAN (R 4.2.0)
#>  tidyselect     1.2.0      2022-10-10 [1] CRAN (R 4.2.0)
#>  tidyverse    * 2.0.0      2023-02-22 [1] CRAN (R 4.2.0)
#>  timechange     0.2.0      2023-01-11 [1] CRAN (R 4.2.0)
#>  timeDate       4022.108   2023-01-07 [1] CRAN (R 4.2.0)
#>  tune         * 1.1.1      2023-04-11 [1] CRAN (R 4.2.0)
#>  tzdb           0.3.0      2022-03-28 [1] CRAN (R 4.2.0)
#>  utf8           1.2.3      2023-01-31 [1] CRAN (R 4.2.0)
#>  vctrs          0.6.3      2023-06-14 [1] CRAN (R 4.2.0)
#>  withr          2.5.0      2022-03-03 [1] CRAN (R 4.2.0)
#>  workflows    * 1.1.3      2023-02-22 [1] CRAN (R 4.2.0)
#>  workflowsets * 1.0.1      2023-04-06 [1] CRAN (R 4.2.0)
#>  xfun           0.39       2023-04-20 [1] CRAN (R 4.2.0)
#>  yaml           2.3.7      2023-01-23 [1] CRAN (R 4.2.0)
#>  yardstick    * 1.2.0      2023-04-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.2/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

bcadenato avatar Oct 31 '23 14:10 bcadenato

thanks for reporting! that does appear to be a bug, or at least the wrong way to handle this situation. We will look into it

EmilHvitfeldt avatar Oct 31 '23 17:10 EmilHvitfeldt