parsnip icon indicating copy to clipboard operation
parsnip copied to clipboard

More details of formula usage in mgcv engine docs when using workflow

Open qiushiyan opened this issue 3 years ago • 2 comments

We need to include more details about using gam formula in the engine docgen_additive_mod(engien = "mgcv"). The engine doc only shows model fitting examples when using gam formula in fit() directly. When using a workflow with recipes, the gam formula needs to be declared in add_model alongside with the model spec

# no inline function in recipe
rec <- recipe(formula = mpg ~ ., data = mtcars)
spec <- gen_additive_mod() %>% 
    set_engine("mgcv")

wf <- workflow() %>% 
    add_recipe(rec) %>%  
    add_model(spec, formula = mpg ~ wt + gear + cyl + s(disp, k = 10))  # use gam formula here

qiushiyan avatar Jul 15 '22 19:07 qiushiyan

A relevant Community post with reprex: https://community.rstudio.com/t/error-in-fit-xy-with-gam-model/143065

simonpcouch avatar Jul 26 '22 17:07 simonpcouch

+1

Steviey avatar Aug 12 '22 09:08 Steviey

Assume we have a response variable, outcome, one numerical predictor, pred_num, and one categorical variable, prec_fac.

Assume GAM formula is:

gam_formula <- "outcome ~ ." |> as.formula()

Then, you preprocess it through recipes with:

data_recipe <- recipes::recipe(
  formula = gam_formula,
  data    = data_train
) |>
  recipes::step_dummy(prec_fac) |>
  # Other Steps ...

# Train the recipe
data_recipe_prep <- data_recipe |>
  recipes::prep(training = data_train)

# Apply to training data
data_train_prep <- data_recipe_prep |>
  recipes::bake(new_data = NULL)

# Apply to test data
data_test_prep <- data_recipe_prep |>
  recipes::bake(new_data = data_test)

For things to work elsewhere, say in tune::tune_grid(), you need to add the following to workflows::add_model():

formula_alt = gam_formula |> terms.formula(data = data_train_prep)

So, whenever we have categorical variables in the model formula, you would need to manually preprocess data and use the terms from that.

This change of formulae in particular, is very confusing, and could potentially cause serious inconsistencies. Where do you use gam_formula vs formula_alt and how would it effect a complex workflow? I hope this gets addressed soon.

siavash-babaei avatar Mar 07 '23 23:03 siavash-babaei

This may be a workflows or hardhat change rather than parsnip, but it might be worth looking out for indicative input in add_formula() or add_recipe() and warn if the formula looks like it might need to be passed as a model formula but add_model(formula) is missing. This is a bit tough since add*() should be able to be called in either order, so maybe that waits for fit.workflow() to be triggered.

simonpcouch avatar Jun 28 '23 14:06 simonpcouch

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Nov 21 '23 00:11 github-actions[bot]