recipes icon indicating copy to clipboard operation
recipes copied to clipboard

Recipes requires step_dummy while parsnip package doesn't

Open jxu opened this issue 2 years ago • 2 comments

library(tidymodels)

x <- mtcars %>% mutate(y = factor(carb > 3), gear = factor(gear))
rec <- recipe(y ~ gear, data = x) 

lr_mod <- logistic_reg(penalty = 1, mixture = 0) %>% set_engine("glmnet")
lr_wf <- workflow() %>%
  add_model(lr_mod) %>%
  add_recipe(rec)

lr_wf %>% fit(data = x)
#> Error in `maybe_matrix()`:
#> ! Some columns are non-numeric. The data cannot be converted to numeric matrix: 'gear'.

lr_mod %>% fit(y ~ gear, data = x)
#> parsnip model object
#> 
#> 
#> Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial",      alpha = ~0) 

The parsnip model lr_mod automatically converts factors to dummy variables (one of the selling points of parsnip), but why doesn't the workflow do so?

My actual data is 1000 columns of binary data (not great), which I imported as factors. It worked fine in parsnip but workflows required step_dummy which takes a while.

jxu avatar Nov 01 '23 20:11 jxu

Hello!

This is not sole a recipes issue per se. What is happening is the difference between how {workflows} and {parsnip} handles things depending on whether a recipe is supplied.

The models that needs numeric predictors will try to acommodate that.

BUT if you supply a recipe, you are taking on the responsibilty of defining the preprocessing steps needs for your model to work. This includes making the decision on how to handle categorical predictors. Dummy encoding, hashing encoding, likelihood encoding and so on. This is not something we are likely to change as we feel that is the right choice.

My actual data is 1000 columns of binary data (not great), which I imported as factors. It worked fine in parsnip but workflows required step_dummy which takes a while.

If you have 1000 binary columns you should import them as logicals or integer (0 and 1) variables. since that is what parsnip or recipe would create anyways.

EmilHvitfeldt avatar Nov 01 '23 22:11 EmilHvitfeldt

There is a question about, why is recipes so much slower than parsnip (which uses model.frame())

library(tidymodels)

make_factor <- function(x) {
  factor(sample(c("A", "B"), 100, TRUE), levels = c("A", "B"))
} 

x <- map(1:1001, make_factor) %>%
  set_names(c("outcome", paste0("x", 1:1000))) %>%
  as_tibble()

rec <- recipe(outcome ~ ., data = x) %>%
  step_dummy(all_nominal_predictors())

lr_mod <- logistic_reg(penalty = 1, mixture = 0) %>% set_engine("glmnet")

lr_wf <- workflow() %>%
  add_model(lr_mod) %>%
  add_recipe(rec)

tictoc::tic("with recipes")
tmp <- lr_wf %>% fit(data = x)
tictoc::toc()
#> with recipes: 5.146 sec elapsed

tictoc::tic("without recipes")
tmp <- lr_mod %>% fit(outcome ~ ., data = x)
tictoc::toc()
#> without recipes: 0.143 sec elapsed

Created on 2023-11-01 with reprex v2.0.2

EmilHvitfeldt avatar Nov 01 '23 22:11 EmilHvitfeldt

I'm closing this issue in favor of https://github.com/tidymodels/recipes/issues/1305

EmilHvitfeldt avatar May 23 '24 16:05 EmilHvitfeldt

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Jun 07 '24 00:06 github-actions[bot]