interpret icon indicating copy to clipboard operation
interpret copied to clipboard

Early Stopping in EBM

Open onacrame opened this issue 3 years ago • 5 comments

Correct me if I'm wrong but the native early stopping mechanism within EBM will just take a random slice of the data. In the case of (i) grouped observations (panel data where one ID might relate to multiple rows of data) or (ii) imbalanced data where one might want to ensure stratification, a random cut may not be optimal. Is there any way to use an iterator to predefine which slice of the data is used for early stopping?

onacrame avatar Jul 28 '21 15:07 onacrame

Hi @onacrame,

Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.

An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined validation_set = (X_val, y_val) as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!

-InterpretML Team

interpret-ml avatar Jul 29 '21 22:07 interpret-ml

Hi @onacrame,

Great point -- our default validation sampling does stratify across the label, but unfortunately does not customize beyond that. Adding support for custom validation sets (which are only used for early stopping) is on our backlog, but has not been implemented yet.

An iterator is an interesting idea. We were also thinking about supplementing the fit call to take in a user defined validation_set = (X_val, y_val) as another option (which we would then sample from for each bag of data). Would be interested to hear your thoughts on different options for defining this!

-InterpretML Team

Defining the validation set would be a great option as one can just use whatever sklearn-type iterators one wants and keeping Interpret-ML api simpler. So default option would be as it is now but with the ability to pass in a user defined validation set.

onacrame avatar Jul 30 '21 07:07 onacrame

@interpret-ml

In catboost (https://catboost.ai/docs/concepts/python-reference_catboostclassifier_fit.html#python-reference_catboostclassifier_fit), xgboost (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) and lightgbm (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html) you have an eval_set parameter for the fit() method, that you can use to provide "A list of (X, y) tuple pairs to use as validation sets, for which metrics will be computed".

candalfigomoro avatar Aug 04 '21 17:08 candalfigomoro

Another ancillary point is that typically after a model building process is finished, it's customary to train the final model on all the data, using whatever early stopping thresholds were found during cross validation or by found while running against a validation set. The EBM framework doesn't really allow for this given that there's always a holdout set and no "refit" of the model without the validation set, so there will always be some portion of the data that cannot be used in the final model.

Just an observation.

onacrame avatar Aug 05 '21 07:08 onacrame

Another problem is that if, for example, you oversampled a class in the training set, you should not have an oversampled validation set (the validation set distribution should be similar to the test set distribution and to the live data distribution). If you split the validation set from the training set, you inherit the oversampled training set distribution. This is also true if you perform data augmentation on the training set. Splitting the validation set from the training set is often a bad idea.

candalfigomoro avatar Aug 17 '21 14:08 candalfigomoro

Any timeline as to how soon this feature will be incorporated? This is extremely crucial esp. for problems where you cant randomly split the data.

sarim-zafar avatar Feb 20 '23 18:02 sarim-zafar

This can now be accomplished with the bags parameter. Details in our docs: https://interpret.ml/docs/ebm.html#interpret.glassbox.ExplainableBoostingClassifier.fit

paulbkoch avatar Aug 11 '23 18:08 paulbkoch