workflows Support for eval(validation) data

The following problem arose where one of the preprocessing steps was embed:: step_lencode_glm (generalized target encoding) and model was xgboost.

From the documentation of parsnip::xgb_train it appears that evaluation data cannot be used for early stopping. While the argument validation sets aside some validation(eval) data for early stopping, its not clear if recipe is applied after splitting train and validation parts. How does this work?

It might be a good idea to support something like this:

# case 1: User specifies train and eval
workflow() %>% 
    add_recipe(some_recipe) %>% 
    add_model(some_model) %>% 
    fit(train_data = A, eval_data = B, use_eval_in_early_stopping = TRUE)

# case 2: Use the existing 'initial_split' class object
splitter = initial_split(dataset, 0.7)

workflow() %>% 
    add_recipe(some_recipe) %>% 
    add_model(some_model) %>% 
    fit(splitter, use_eval_in_early_stopping = TRUE)

where

a recipe is always trained on the train part and baked on the eval(validation) part
eval data to be used in early stopping if the algorithm supports it and the flag is set to true.

Nov 26 '22 05:11 talegari

Thanks for the issue!

From the documentation of parsnip::xgb_train it appears that evaluation data cannot be used for early stopping. While the argument validation sets aside some validation(eval) data for early stopping, its not clear if recipe is applied after splitting train and validation parts. How does this work?

While fitting that workflow on some training set A, the recipe is first applied and processes all of A, then passes bake(A) to parsnip::xgb_train. The validation argument in that function can then be used to allot some of bake(A) for use in a watchlist (i.e. as a validation set, possibly for early stopping).

eval data to be used in early stopping if the algorithm supports it and the flag is set to true.

The functionality that (I believe) the use_eval_in_early_stopping argument proposed here is implementing is already possible by passing validation and setting a non-NULL early_stop argument:

some_model <-
  boost_tree() %>%
  set_engine("xgboost", validation = .2, early_stop = 10)

If this doesn't cover your use case, could you please provide a minimal reprex (reproducible example) demonstrating the functionality you'd like to see? Could you also clarify your notation "eval(validation)" and "validation(eval)"?

Nov 29 '22 17:11 simonpcouch

Simon, thanks for your response.

The problem is right here:

While fitting that workflow on some training set A, the recipe is first applied and processes all of A, then passes bake(A) to parsnip::xgb_train. The validation argument in that function can then be used to allot some of bake(A) for use in a watchlist (i.e. as a validation set, possibly for early stopping).

Current flow:

data (A) 
--> train(prep) recipe on A 
--> apply recipe on A to get A_new
--> split A_new into A_train and A_validation
--> model on A_train and use A_validation for early stopping

Expected flow:

data (A) 
--> split into A_train and A_validation 
--> train(prep) a recipe on A_train
--> apply(bake) trained(prepped) recipe on A_train and A_validation to obtain train_new and validation_new
--> model on train_new and validation_new for early stopping

In the current flow, there is a data leak as recipe learns from combined dataset.

Dec 03 '22 05:12 talegari

So there is the possibility that the validation set used inside of xgb_train() might lead to optimistic results (due to preprocessing - not the model).

If that is a potential issue for your recipe, I would use validation_split() instead of the validation argument of xgb_train(). That will quarantine the holdout data for both the model and preprocessor. You can tune trees to use this external validation set to determine when to stop boosting.

The api that you suggest is difficult to implement since xgb_train() does not have access to the recipe and could not preprocess that data separately.

Dec 06 '22 17:12 topepo

Max, thanks for your response.

You can tune trees to use this external validation set to determine when to stop boosting.

Would you mind helping me some example(code) of achieving this?

PS: tidymodels offers a great system to build and use models. I would like to keep workflow clean and not make a call to xgboost::xgb.train with validation data in watchlist parameter, which takes me out of tidymodels paradigm.

Dec 07 '22 05:12 talegari

workflows workflows copied to clipboard

Support for eval(validation) data

workflows
workflows copied to clipboard