recipes
recipes copied to clipboard
Recipes requires step_dummy while parsnip package doesn't
library(tidymodels)
x <- mtcars %>% mutate(y = factor(carb > 3), gear = factor(gear))
rec <- recipe(y ~ gear, data = x)
lr_mod <- logistic_reg(penalty = 1, mixture = 0) %>% set_engine("glmnet")
lr_wf <- workflow() %>%
add_model(lr_mod) %>%
add_recipe(rec)
lr_wf %>% fit(data = x)
#> Error in `maybe_matrix()`:
#> ! Some columns are non-numeric. The data cannot be converted to numeric matrix: 'gear'.
lr_mod %>% fit(y ~ gear, data = x)
#> parsnip model object
#>
#>
#> Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial", alpha = ~0)
The parsnip model lr_mod automatically converts factors to dummy variables (one of the selling points of parsnip), but why doesn't the workflow do so?
My actual data is 1000 columns of binary data (not great), which I imported as factors. It worked fine in parsnip but workflows required step_dummy which takes a while.
Hello!
This is not sole a recipes issue per se. What is happening is the difference between how {workflows} and {parsnip} handles things depending on whether a recipe is supplied.
The models that needs numeric predictors will try to acommodate that.
BUT if you supply a recipe, you are taking on the responsibilty of defining the preprocessing steps needs for your model to work. This includes making the decision on how to handle categorical predictors. Dummy encoding, hashing encoding, likelihood encoding and so on. This is not something we are likely to change as we feel that is the right choice.
My actual data is 1000 columns of binary data (not great), which I imported as factors. It worked fine in parsnip but workflows required step_dummy which takes a while.
If you have 1000 binary columns you should import them as logicals or integer (0 and 1) variables. since that is what parsnip or recipe would create anyways.
There is a question about, why is recipes so much slower than parsnip (which uses model.frame())
library(tidymodels)
make_factor <- function(x) {
factor(sample(c("A", "B"), 100, TRUE), levels = c("A", "B"))
}
x <- map(1:1001, make_factor) %>%
set_names(c("outcome", paste0("x", 1:1000))) %>%
as_tibble()
rec <- recipe(outcome ~ ., data = x) %>%
step_dummy(all_nominal_predictors())
lr_mod <- logistic_reg(penalty = 1, mixture = 0) %>% set_engine("glmnet")
lr_wf <- workflow() %>%
add_model(lr_mod) %>%
add_recipe(rec)
tictoc::tic("with recipes")
tmp <- lr_wf %>% fit(data = x)
tictoc::toc()
#> with recipes: 5.146 sec elapsed
tictoc::tic("without recipes")
tmp <- lr_mod %>% fit(outcome ~ ., data = x)
tictoc::toc()
#> without recipes: 0.143 sec elapsed
Created on 2023-11-01 with reprex v2.0.2
I'm closing this issue in favor of https://github.com/tidymodels/recipes/issues/1305
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.