parsnip
parsnip copied to clipboard
More details of formula usage in mgcv engine docs when using workflow
We need to include more details about using gam formula in the engine docgen_additive_mod(engien = "mgcv"). The engine doc only shows model fitting examples when using gam formula in fit() directly. When using a workflow with recipes, the gam formula needs to be declared in add_model alongside with the model spec
# no inline function in recipe
rec <- recipe(formula = mpg ~ ., data = mtcars)
spec <- gen_additive_mod() %>%
set_engine("mgcv")
wf <- workflow() %>%
add_recipe(rec) %>%
add_model(spec, formula = mpg ~ wt + gear + cyl + s(disp, k = 10)) # use gam formula here
A relevant Community post with reprex: https://community.rstudio.com/t/error-in-fit-xy-with-gam-model/143065
+1
Assume we have a response variable, outcome, one numerical predictor, pred_num, and one categorical variable, prec_fac.
Assume GAM formula is:
gam_formula <- "outcome ~ ." |> as.formula()
Then, you preprocess it through recipes with:
data_recipe <- recipes::recipe(
formula = gam_formula,
data = data_train
) |>
recipes::step_dummy(prec_fac) |>
# Other Steps ...
# Train the recipe
data_recipe_prep <- data_recipe |>
recipes::prep(training = data_train)
# Apply to training data
data_train_prep <- data_recipe_prep |>
recipes::bake(new_data = NULL)
# Apply to test data
data_test_prep <- data_recipe_prep |>
recipes::bake(new_data = data_test)
For things to work elsewhere, say in tune::tune_grid(), you need to add the following to workflows::add_model():
formula_alt = gam_formula |> terms.formula(data = data_train_prep)
So, whenever we have categorical variables in the model formula, you would need to manually preprocess data and use the terms from that.
This change of formulae in particular, is very confusing, and could potentially cause serious inconsistencies. Where do you use gam_formula vs formula_alt and how would it effect a complex workflow? I hope this gets addressed soon.
This may be a workflows or hardhat change rather than parsnip, but it might be worth looking out for indicative input in add_formula() or add_recipe() and warn if the formula looks like it might need to be passed as a model formula but add_model(formula) is missing. This is a bit tough since add*() should be able to be called in either order, so maybe that waits for fit.workflow() to be triggered.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.