recipes
recipes copied to clipboard
id fields converted to NA
From this post.
library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)
data <- tibble(
id = letters[1:12],
output = rnorm(12, mean = 0),
pred1 = rnorm(12, mean = 10),
pred2 = rnorm(12, mean = 20),
pred3 = factor(rep(c('f1', 'f2', 'f3'), 4))
)
data$pred1[c(1,6)] <- NA
data$pred2[c(2,7)] <- NA
df_train <- data[1:5,]
df_test <- data[6:10,]
rec_obj <- recipe(x = df_train) %>%
update_role(output, new_role = 'outcome') %>%
update_role(id, new_role = "id variable") %>%
update_role(-output, -id, new_role = 'predictor') %>%
step_dummy(pred3) %>%
step_center(pred1, pred2) %>%
step_scale(pred1, pred2) %>%
step_medianimpute(all_predictors())
rec_trained <- prep(rec_obj, training = df_train)
train_data <- bake(rec_trained, new_data = df_train)
test_data <- bake(rec_trained, new_data = df_test)
test_data
#> # A tibble: 5 x 6
#> id output pred1 pred2 pred3_f2 pred3_f3
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 <NA> -1.66 0.155 -0.484 0 1
#> 2 <NA> -0.0917 -1.63 -0.375 0 0
#> 3 <NA> 0.0398 0.143 -0.975 1 0
#> 4 <NA> 0.0514 0.612 0.110 0 1
#> 5 <NA> -0.746 -2.95 -2.93 0 0
Created on 2019-02-10 by the reprex package (v0.2.1)
I'm not sure this is a bug. I think that user just wanted to use strings_as_factors = FALSE.
It looks like when strings_as_factors is TRUE, then in prep() all character columns are coerced to factor and the levels are stored (so this happened to id and the levels a through e were stored).
Then in bake() it checks the test data for any columns that had levels in the training data, and tries to coerce those test data columns to factors using the same levels as the training data. This makes complete sense for most preprocessing steps. For id, that means that the new levels in the test data (letters g and beyond) were not present in the training data, so when the factor is created we just get NA values back (they are treated as "new" levels).
The only thing I can think to do is to only run the strings_as_factors conversion on predictor columns, which may or may not be a smart thing to do. If we did that we would need to be sure to document it.
This is causing some issues for me using {workflows}, it caught completely off-guard. Is there a way to set strings_as_factors = FALSE globally or set it in {recipes}? AFAIK there is not way to pass this down to prep() when using workflows containing an add_recipe() step.
Results with https://github.com/tidymodels/recipes/pull/706
library(dplyr, warn.conflicts = FALSE)
library(recipes, warn.conflicts = FALSE)
data <- tibble(
id = letters[1:12],
output = rnorm(12, mean = 0),
pred1 = rnorm(12, mean = 10),
pred2 = rnorm(12, mean = 20),
pred3 = factor(rep(c('f1', 'f2', 'f3'), 4))
)
data$pred1[c(1,6)] <- NA
data$pred2[c(2,7)] <- NA
df_train <- data[1:5,]
df_test <- data[6:10,]
rec_obj <- recipe(x = df_train, strings_as_factors = TRUE) %>%
update_role(output, new_role = 'outcome') %>%
update_role(id, new_role = "id variable") %>%
update_role(-output, -id, new_role = 'predictor') %>%
step_dummy(pred3) %>%
step_center(pred1, pred2) %>%
step_scale(pred1, pred2) %>%
step_impute_median(all_predictors())
rec_trained <- prep(rec_obj, training = df_train)
train_data <- bake(rec_trained, new_data = df_train)
test_data <- bake(rec_trained, new_data = df_test)
test_data
#> # A tibble: 5 × 6
#> id output pred1 pred2 pred3_f2 pred3_f3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 f -0.234 0.108 0.215 0 1
#> 2 g -2.45 -0.764 -0.323 0 0
#> 3 h 1.33 -1.39 -1.14 1 0
#> 4 i 0.0191 -3.12 1.50 0 1
#> 5 j 0.253 0.0114 -2.11 0 0
Created on 2025-03-27 with reprex v2.1.1
is closed via https://github.com/tidymodels/recipes/pull/706
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.