juliasilge.com
juliasilge.com copied to clipboard
Tune xgboost models with early stopping to predict shelter animal status | Julia Silge
Tune xgboost models with early stopping to predict shelter animal status | Julia Silge
Early stopping can keep an xgboost model from overfitting.
Hi Julia, Thank you for this video. Very helpful! I was wondering if there was a way to see what specific arguments are available when declaring a computational engine. I was trying to find some info about it in the Tidymodels website but I couldn't find anything. Thank you!
Hello Dr. Silge, thanks for the analysis. Are you by chance using the dev version of parsnip? I keep getting an error “could not find function stop_iter” even when running your code as is.
Thanks.
@luisdominguezromero We recently revamped the parsnip documentation to try to surface this information better. For example, take a look at the main landing page for boost_tree()
, which has links for the different engines.
@gunnergalactico Ah, you don't need anything but CRAN parsnip, but you do need the GitHub version of dials for the stop_iter()
parameter. Sorry about that!! I have got to start adding session info to my blog posts. 🙈
Hi Julia, If I want to perform test base prediction (dataset in kaggle) after last_fit(), how is stopping_fit used? Is it stopping_fit that should be saved to ".rds"? Thank you very much
@AleLustosa The object you would want to use for predicting on new data is extract_workflow(stopping_fit)
(that is a fitted workflow), so you could store that as something like stopping_fitted_wf
and then save to .rds
.
Hi Julia, thanks for your tutorials, they are very helpful. I just want to do something like add step_holiday do recipes (Christmas) and the add step_lag which would base on this holiday predictor. From my retail experience, people usually organised presents week or two before holiday. How would you do this with recipes package? (basically lag Christmas variable)
I could do this with normal data transformation, but I'm wonder weather it's possible to manipulate with variable that were created during recipe (in step pipeline).
Thanks in advance sewe
I believe you should be able to use step_lag()
with any new variables you create from step_holiday()
. If you end up having trouble, I recommend that you create a reprex (a minimal reproducible example) showing what problems you run into. The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. A good place to ask questions like that is RStudio Community.
Hi Julia, Thanks for sharing the tutorial. Could you explain why you chose the best parameters based on "mn_log_loss", but evaluated the model performance in terms of "accuracy" and "roc_auc"?
@youngjin-lee No particular reason; you can pass in a custom metric set to last_fit()
with the metrics
argument to set which metrics to use for the testing set.
Hi Julia,
A few questions for you:
-Is it possible to plot the tree itself with tidymodels?
-I'm trying to use the vip package to get the variable importance scores, but running into this error with the vi function:
Error in eval(stats::getCall(object)$data) : object 'x' not found
However, the plot itself functions just fine. Have you run into this at all?
-Due to a peculiar circumstance, I don't need to split my data into training and testing. Do have any advice on how to train the model without splitting?
Thanks so much for all you contribute to the R community! Tidymodels and your tutorials have been a huge help for me!
-
An xgboost model is a boosted tree model so it doesn't really make sense to plot "the tree" (there isn't a single tree). If you train a single decision tree, then you can plot it like this or like this.
-
I haven't had that problem but if you can create a reprex (a minimal reproducible example) for this, you can share it somewhere like RStudio Community and get help.
-
For a powerful model like xgboost, you pretty much always need separate training and testing sets. If what you mean is that you already have defined training and testing sets (you don't need to split) then you can manually create a split using the development version of rsample.
Julia, great contribution as always!
A question: would the proportion 0.8/0.2 used in the early stopping follow the stratification defined in data split/cv?
Thanks in advance
@fdeoliveirag No, that is just a random split. You maybe could pass your own validation data (perhaps created via validation_split()
) as the xgboost watchlist
argument (which would be a boost_tree()
engine argument)? I haven't tried that out, I don't think.
If a stratified internal validation set is something you are interested in, you might open an issue on parsnip outlining your use case.
@juliasilge Thanks for another great screencast, I feel a little bit confuse between concept of steps and iteration, could you please explain it for me or could you recommend any material to read about it?
@conlelevn I'm not quite sure what you're asking. Do you mean how early stopping works (in the context of boosting)? I think Wikipedia is nice on this, and it has a little section specifically on early stopping in boosting.
Hi @juliasilge - Trying to use the xgboost engine in tidymodels, how can I get around the inclusion of a date column needing to be in
factor_sliding_folds <- rsample::sliding_period(
train_set |> arrange(date),
index = date,
period = "quarter",
lookback = Inf,
skip = 4,
assess_stop = 1,
complete = FALSE
)
Earlier comment should say: be in date format when I create ... and then but then needing numeric when
@jlecornu3 I believe you'll want to use some feature engineering like step_date()
to build numeric features for xgboost from your date variable.