juliasilge.com icon indicating copy to clipboard operation
juliasilge.com copied to clipboard

Tune xgboost models with early stopping to predict shelter animal status | Julia Silge

Open utterances-bot opened this issue 3 years ago • 19 comments

Tune xgboost models with early stopping to predict shelter animal status | Julia Silge

Early stopping can keep an xgboost model from overfitting.

https://juliasilge.com/blog/shelter-animals/

utterances-bot avatar Aug 08 '21 02:08 utterances-bot

Hi Julia, Thank you for this video. Very helpful! I was wondering if there was a way to see what specific arguments are available when declaring a computational engine. I was trying to find some info about it in the Tidymodels website but I couldn't find anything. Thank you!

luisdominguezromero avatar Aug 08 '21 02:08 luisdominguezromero

Hello Dr. Silge, thanks for the analysis. Are you by chance using the dev version of parsnip? I keep getting an error “could not find function stop_iter” even when running your code as is.

Thanks.

gunnergalactico avatar Aug 08 '21 02:08 gunnergalactico

@luisdominguezromero We recently revamped the parsnip documentation to try to surface this information better. For example, take a look at the main landing page for boost_tree(), which has links for the different engines.

juliasilge avatar Aug 08 '21 03:08 juliasilge

@gunnergalactico Ah, you don't need anything but CRAN parsnip, but you do need the GitHub version of dials for the stop_iter() parameter. Sorry about that!! I have got to start adding session info to my blog posts. 🙈

juliasilge avatar Aug 08 '21 03:08 juliasilge

Hi Julia, If I want to perform test base prediction (dataset in kaggle) after last_fit(), how is stopping_fit used? Is it stopping_fit that should be saved to ".rds"? Thank you very much

AleLustosa avatar Aug 08 '21 15:08 AleLustosa

@AleLustosa The object you would want to use for predicting on new data is extract_workflow(stopping_fit) (that is a fitted workflow), so you could store that as something like stopping_fitted_wf and then save to .rds.

juliasilge avatar Aug 08 '21 18:08 juliasilge

Hi Julia, thanks for your tutorials, they are very helpful. I just want to do something like add step_holiday do recipes (Christmas) and the add step_lag which would base on this holiday predictor. From my retail experience, people usually organised presents week or two before holiday. How would you do this with recipes package? (basically lag Christmas variable)

I could do this with normal data transformation, but I'm wonder weather it's possible to manipulate with variable that were created during recipe (in step pipeline).

Thanks in advance sewe

SewerynGrodny avatar Sep 15 '21 11:09 SewerynGrodny

I believe you should be able to use step_lag() with any new variables you create from step_holiday(). If you end up having trouble, I recommend that you create a reprex (a minimal reproducible example) showing what problems you run into. The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it. A good place to ask questions like that is RStudio Community.

juliasilge avatar Sep 15 '21 20:09 juliasilge

Hi Julia, Thanks for sharing the tutorial. Could you explain why you chose the best parameters based on "mn_log_loss", but evaluated the model performance in terms of "accuracy" and "roc_auc"?

youngjin-lee avatar Sep 19 '21 21:09 youngjin-lee

@youngjin-lee No particular reason; you can pass in a custom metric set to last_fit() with the metrics argument to set which metrics to use for the testing set.

juliasilge avatar Sep 20 '21 15:09 juliasilge

Hi Julia,

A few questions for you:

-Is it possible to plot the tree itself with tidymodels?

-I'm trying to use the vip package to get the variable importance scores, but running into this error with the vi function:

Error in eval(stats::getCall(object)$data) : object 'x' not found

However, the plot itself functions just fine. Have you run into this at all?

-Due to a peculiar circumstance, I don't need to split my data into training and testing. Do have any advice on how to train the model without splitting?

Thanks so much for all you contribute to the R community! Tidymodels and your tutorials have been a huge help for me!

eryn-carleton avatar Oct 01 '21 23:10 eryn-carleton

  • An xgboost model is a boosted tree model so it doesn't really make sense to plot "the tree" (there isn't a single tree). If you train a single decision tree, then you can plot it like this or like this.

  • I haven't had that problem but if you can create a reprex (a minimal reproducible example) for this, you can share it somewhere like RStudio Community and get help.

  • For a powerful model like xgboost, you pretty much always need separate training and testing sets. If what you mean is that you already have defined training and testing sets (you don't need to split) then you can manually create a split using the development version of rsample.

juliasilge avatar Oct 02 '21 02:10 juliasilge

Julia, great contribution as always!

A question: would the proportion 0.8/0.2 used in the early stopping follow the stratification defined in data split/cv?

Thanks in advance

fdeoliveirag avatar Jun 14 '22 21:06 fdeoliveirag

@fdeoliveirag No, that is just a random split. You maybe could pass your own validation data (perhaps created via validation_split()) as the xgboost watchlist argument (which would be a boost_tree() engine argument)? I haven't tried that out, I don't think.

If a stratified internal validation set is something you are interested in, you might open an issue on parsnip outlining your use case.

juliasilge avatar Jun 14 '22 22:06 juliasilge

@juliasilge Thanks for another great screencast, I feel a little bit confuse between concept of steps and iteration, could you please explain it for me or could you recommend any material to read about it?

conlelevn avatar Jul 11 '22 03:07 conlelevn

@conlelevn I'm not quite sure what you're asking. Do you mean how early stopping works (in the context of boosting)? I think Wikipedia is nice on this, and it has a little section specifically on early stopping in boosting.

juliasilge avatar Jul 11 '22 14:07 juliasilge

Hi @juliasilge - Trying to use the xgboost engine in tidymodels, how can I get around the inclusion of a date column needing to be in format when I create the expanding/sliding window validation folds, but then needing when I come to the xgboost fit?

factor_sliding_folds <- rsample::sliding_period(
  train_set |> arrange(date),
  index = date,
  period = "quarter",
  lookback = Inf,
  skip = 4,
  assess_stop = 1,
  complete = FALSE
)

jlecornu3 avatar Jan 22 '24 08:01 jlecornu3

Earlier comment should say: be in date format when I create ... and then but then needing numeric when

jlecornu3 avatar Jan 22 '24 10:01 jlecornu3

@jlecornu3 I believe you'll want to use some feature engineering like step_date() to build numeric features for xgboost from your date variable.

juliasilge avatar Jan 22 '24 16:01 juliasilge