recipes icon indicating copy to clipboard operation
recipes copied to clipboard

id fields converted to NA

Open topepo opened this issue 6 years ago • 2 comments

From this post.

library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)

data <- tibble(
  id = letters[1:12],
  output = rnorm(12, mean = 0),
  pred1 = rnorm(12, mean = 10),
  pred2 = rnorm(12, mean = 20),
  pred3 = factor(rep(c('f1', 'f2', 'f3'), 4))
)

data$pred1[c(1,6)] <- NA
data$pred2[c(2,7)] <- NA
df_train <- data[1:5,]
df_test <- data[6:10,]

rec_obj <- recipe(x = df_train) %>%
  update_role(output, new_role = 'outcome') %>%
  update_role(id, new_role = "id variable") %>%
  update_role(-output, -id, new_role = 'predictor') %>%
  step_dummy(pred3) %>%
  step_center(pred1, pred2) %>%
  step_scale(pred1, pred2) %>%
  step_medianimpute(all_predictors())

rec_trained <- prep(rec_obj, training = df_train)
train_data    <- bake(rec_trained, new_data = df_train)
test_data     <- bake(rec_trained, new_data = df_test)
test_data
#> # A tibble: 5 x 6
#>   id     output  pred1  pred2 pred3_f2 pred3_f3
#>   <fct>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
#> 1 <NA>  -1.66    0.155 -0.484        0        1
#> 2 <NA>  -0.0917 -1.63  -0.375        0        0
#> 3 <NA>   0.0398  0.143 -0.975        1        0
#> 4 <NA>   0.0514  0.612  0.110        0        1
#> 5 <NA>  -0.746  -2.95  -2.93         0        0

Created on 2019-02-10 by the reprex package (v0.2.1)

topepo avatar Feb 10 '19 21:02 topepo

I'm not sure this is a bug. I think that user just wanted to use strings_as_factors = FALSE.

It looks like when strings_as_factors is TRUE, then in prep() all character columns are coerced to factor and the levels are stored (so this happened to id and the levels a through e were stored).

Then in bake() it checks the test data for any columns that had levels in the training data, and tries to coerce those test data columns to factors using the same levels as the training data. This makes complete sense for most preprocessing steps. For id, that means that the new levels in the test data (letters g and beyond) were not present in the training data, so when the factor is created we just get NA values back (they are treated as "new" levels).

The only thing I can think to do is to only run the strings_as_factors conversion on predictor columns, which may or may not be a smart thing to do. If we did that we would need to be sure to document it.

DavisVaughan avatar Feb 11 '19 00:02 DavisVaughan

This is causing some issues for me using {workflows}, it caught completely off-guard. Is there a way to set strings_as_factors = FALSE globally or set it in {recipes}? AFAIK there is not way to pass this down to prep() when using workflows containing an add_recipe() step.

jcpsantiago avatar Aug 25 '20 12:08 jcpsantiago

Results with https://github.com/tidymodels/recipes/pull/706

library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)

data <- tibble(
  id = letters[1:12],
  output = rnorm(12, mean = 0),
  pred1 = rnorm(12, mean = 10),
  pred2 = rnorm(12, mean = 20),
  pred3 = factor(rep(c('f1', 'f2', 'f3'), 4))
)

data$pred1[c(1,6)] <- NA
data$pred2[c(2,7)] <- NA
df_train <- data[1:5,]
df_test <- data[6:10,]

rec_obj <- recipe(x = df_train, strings_as_factors = TRUE) %>%
  update_role(output, new_role = 'outcome') %>%
  update_role(id, new_role = "id variable") %>%
  update_role(-output, -id, new_role = 'predictor') %>%
  step_dummy(pred3) %>%
  step_center(pred1, pred2) %>%
  step_scale(pred1, pred2) %>%
  step_impute_median(all_predictors())

rec_trained <- prep(rec_obj, training = df_train)
train_data    <- bake(rec_trained, new_data = df_train)
test_data     <- bake(rec_trained, new_data = df_test)
test_data
#> # A tibble: 5 × 6
#>   id     output   pred1  pred2 pred3_f2 pred3_f3
#>   <chr>   <dbl>   <dbl>  <dbl>    <dbl>    <dbl>
#> 1 f     -0.234   0.108   0.215        0        1
#> 2 g     -2.45   -0.764  -0.323        0        0
#> 3 h      1.33   -1.39   -1.14         1        0
#> 4 i      0.0191 -3.12    1.50         0        1
#> 5 j      0.253   0.0114 -2.11         0        0

Created on 2025-03-27 with reprex v2.1.1

EmilHvitfeldt avatar Mar 27 '25 18:03 EmilHvitfeldt

is closed via https://github.com/tidymodels/recipes/pull/706

EmilHvitfeldt avatar Apr 03 '25 00:04 EmilHvitfeldt

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Apr 17 '25 00:04 github-actions[bot]