xgboost (scikit-learn api) (feature request) XGBRegressor with early stopping, dynamic eval

XGBRegressor has the parameter eval_set, where you pass an evaluation set that regressor uses to perform early stopping. Since this eval_set is fixed, when you do cross validation with n folds. In the n folds, eval_set is the same.

Would be cool to have the option that eval_set is dynamic at fit time, being a split from that particular fold. Find an example below. IF you like the idea, happy to do a proper PR

class XGBRegressorWithEarlyStop(XGBRegressor):
    """Wrapper of XGBRegressor with early stopping."""

    def __init__(self, objective="reg:squarederror", early_stopping_rounds=5,
                 test_size=0.1, eval_metric='rmse', shuffle=False, **kwargs):
        """Init as super."""
        self.early_stopping_rounds = early_stopping_rounds
        self.test_size = test_size
        self.eval_metric = eval_metric
        self.shuffle = shuffle
        super().__init__(objective=objective, **kwargs)

    def fit(self, x, y, verbose=False, sample_weight=None):
        """Fit classifier."""
        if sample_weight is not None:
            x_train, x_val, y_train, y_val, w_train, w_val = train_test_split(
                x, y, sample_weight,
                test_size=self.test_size, shuffle=self.shuffle)
        else:
            x_train, x_val, y_train, y_val = train_test_split(
                x, y,
                test_size=self.test_size, shuffle=self.shuffle)
            w_train, w_val = None, None
        super().fit(x_train, y_train,
                    early_stopping_rounds=self.early_stopping_rounds,
                    eval_metric=self.eval_metric,
                    eval_set=[(x_val, y_val)],
                    verbose=verbose,
                    sample_weight=w_train,
                    sample_weight_eval_set=[w_val])
        return self

Apr 04 '22 10:04 iuiu34

Thank you for the offer, I understand that sklearn is doing es this way. But we have to consider other optional information than sample_weight, also GPU data structure and distributed training. It might not be a flexible design that we embed data partition into our code.

Apr 21 '22 05:04 trivialfis

Having said that, I would welcome any discussion around the feature and design.

Apr 21 '22 06:04 trivialfis

assuming we're talking about XGBRegressor (XGBClassifer is equivalent)

I see 3 options

Option 1 - new class Define new class XGBRegressorWithEarlyStop as above. This will add the feature, without altering native class. But then you have 2 very similar classes, more documentation to maintain, etc.

Option 2 - implicit params We don't add any new param to XGBRegressor. But now if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None) instead of raising the error eval_set is not defined, the class creates the eval_set dynamically at fit time.

Option 2 - explicit params We add new param to XGBRegressor

eval_set_dynamic: bool = True - eval_set is created dynamically.

if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=None, eval_set_dynamic=False), then still raises the error eval_set is not defined if we call XGBRegressor(..., early_stopping_rounds=5,eval_set=[(x_val,y_val)], eval_set_dynamic=True), then raises the error you can not pass static eval_set when eval_set_dynamic is True.

Apr 21 '22 07:04 iuiu34

Option 2 seems to be reasonable:

def fit(...):
    if self.early_stopping_rounds is not None and eval_set is None:
        train_X, valid_X, train_y, valid_y = train_test_split(...)
    else:
       ...

I think we would like to keep the eval_set in fit, as it's a data-dependent parameter and should be specified under the fit method according to the sklearn estimator guideline.

The next issue is parameters other than sample_weight, we also have base_margin for all estimators. Also, we have learning-to-rank and survival training is coming to the sklearn interface, which has its own way of specifying the data. Specializing over each of them will complicate the code significantly.

Lastly, as mentioned in the previous comment, distributed training and GPU input also needs to be considered.

Apr 21 '22 07:04 trivialfis

xgboost
xgboost copied to clipboard

(scikit-learn api) (feature request) XGBRegressor with early stopping, dynamic eval_set

xgboost xgboost copied to clipboard

(scikit-learn api) (feature request) XGBRegressor with early stopping, dynamic eval_set

xgboost
xgboost copied to clipboard