xgboost
xgboost copied to clipboard
(scikit-learn api) (feature request) XGBRegressor with early stopping, dynamic eval_set
XGBRegressor has the parameter eval_set, where you pass an evaluation set that regressor uses to perform early stopping. Since this eval_set is fixed, when you do cross validation with n folds. In the n folds, eval_set is the same.
Would be cool to have the option that eval_set is dynamic at fit
time, being a split from that particular fold. Find an example below.
IF you like the idea, happy to do a proper PR
class XGBRegressorWithEarlyStop(XGBRegressor):
"""Wrapper of XGBRegressor with early stopping."""
def __init__(self, objective="reg:squarederror", early_stopping_rounds=5,
test_size=0.1, eval_metric='rmse', shuffle=False, **kwargs):
"""Init as super."""
self.early_stopping_rounds = early_stopping_rounds
self.test_size = test_size
self.eval_metric = eval_metric
self.shuffle = shuffle
super().__init__(objective=objective, **kwargs)
def fit(self, x, y, verbose=False, sample_weight=None):
"""Fit classifier."""
if sample_weight is not None:
x_train, x_val, y_train, y_val, w_train, w_val = train_test_split(
x, y, sample_weight,
test_size=self.test_size, shuffle=self.shuffle)
else:
x_train, x_val, y_train, y_val = train_test_split(
x, y,
test_size=self.test_size, shuffle=self.shuffle)
w_train, w_val = None, None
super().fit(x_train, y_train,
early_stopping_rounds=self.early_stopping_rounds,
eval_metric=self.eval_metric,
eval_set=[(x_val, y_val)],
verbose=verbose,
sample_weight=w_train,
sample_weight_eval_set=[w_val])
return self
Thank you for the offer, I understand that sklearn is doing es this way. But we have to consider other optional information than sample_weight, also GPU data structure and distributed training. It might not be a flexible design that we embed data partition into our code.
Having said that, I would welcome any discussion around the feature and design.
assuming we're talking about XGBRegressor
(XGBClassifer
is equivalent)
I see 3 options
Option 1 - new class
Define new class XGBRegressorWithEarlyStop
as above.
This will add the feature, without altering native class.
But then you have 2 very similar classes, more documentation to maintain, etc.
Option 2 - implicit params
We don't add any new param to XGBRegressor
.
But now if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None)
instead of raising the error eval_set is not defined
, the class creates the eval_set
dynamically at fit
time.
Option 2 - explicit params
We add new param to XGBRegressor
eval_set_dynamic: bool = True - eval_set is created dynamically.
if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None, eval_set_dynamic=False)
, then still raises the error eval_set is not defined
if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=[(x_val,y_val)], eval_set_dynamic=True)
, then raises the error you can not pass static eval_set when eval_set_dynamic is True.
Option 2 seems to be reasonable:
def fit(...):
if self.early_stopping_rounds is not None and eval_set is None:
train_X, valid_X, train_y, valid_y = train_test_split(...)
else:
...
I think we would like to keep the eval_set
in fit
, as it's a data-dependent parameter and should be specified under the fit
method according to the sklearn estimator guideline.
The next issue is parameters other than sample_weight
, we also have base_margin
for all estimators. Also, we have learning-to-rank and survival training is coming to the sklearn interface, which has its own way of specifying the data. Specializing over each of them will complicate the code significantly.
Lastly, as mentioned in the previous comment, distributed training and GPU input also needs to be considered.