xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

Early stopping with gblinear doesn't save the best model for subsequent prediction

Open sktin opened this issue 1 year ago • 1 comments

Code to replicate:

import xgboost as xgb
print(F'{xgb.__version__=}')

from sklearn.datasets import make_classification
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

X, y = make_classification(50000, random_state=0)
model = XGBClassifier(
    booster='gblinear', updater='coord_descent',
    eval_metric='auc', eta=0.01, 
    early_stopping_rounds=10, n_estimators=1000000, 
    random_state=0, n_jobs=4
)
model.fit(X, y, eval_set=[(X, y)], verbose=1)
print(F'{model.best_iteration=}, {model.best_score=}')
print(roc_auc_score(y, model.predict_proba(X)[:,1]))

Output:

xgb.__version__='2.1.1'
[0]	validation_0-auc:0.93851
[1]	validation_0-auc:0.93851
[2]	validation_0-auc:0.93851
[3]	validation_0-auc:0.93851
[4]	validation_0-auc:0.93851
[5]	validation_0-auc:0.93851
[6]	validation_0-auc:0.93851
[7]	validation_0-auc:0.93851
[8]	validation_0-auc:0.93851
[9]	validation_0-auc:0.93851
[10]	validation_0-auc:0.93851
[11]	validation_0-auc:0.93851
[12]	validation_0-auc:0.93851
[13]	validation_0-auc:0.93851
[14]	validation_0-auc:0.93851
[15]	validation_0-auc:0.93850
[16]	validation_0-auc:0.93850
model.best_iteration=6, model.best_score=0.9385135544606076
0.9385032096436587

iteration_range has no effect.

print(roc_auc_score(y, model.predict_proba(X, iteration_range=(0,7))[:,1]))
print(roc_auc_score(y, model.predict_proba(X, iteration_range=(0,1000000))[:,1]))
print(roc_auc_score(y, model.predict_proba(X, iteration_range=(0,1))[:,1]))

Output:

0.9385032096436587
0.9385032096436587
0.9385032096436587

The only workaround I can think of is re-fitting with the best_iteration found.

model = XGBClassifier(
    booster='gblinear', updater='coord_descent',
    eval_metric='auc', eta=0.01, 
    n_estimators=model.best_iteration+1, 
    random_state=0, n_jobs=4
)
model.fit(X, y, eval_set=[(X, y)], verbose=1)
print(roc_auc_score(y, model.predict_proba(X)[:,1]))

Output:

[0]	validation_0-auc:0.93851
[1]	validation_0-auc:0.93851
[2]	validation_0-auc:0.93851
[3]	validation_0-auc:0.93851
[4]	validation_0-auc:0.93851
[5]	validation_0-auc:0.93851
[6]	validation_0-auc:0.93851
0.9385135544606077

sktin avatar Oct 15 '24 15:10 sktin

  • When using early stopping with gblinear in XGBoost, it does not save the best model for subsequent predictions because gblinear does not support tree-based boosting logic like gbtree, and early stopping behaves differently.

  • gbtree builds a sequence of trees, each improving on the last, while gblinear updates a single set of weights iteratively. Since gblinear does not store previous iterations, early stopping stops the training but does not revert to the best iteration.

  • To address this, we can manually track and save the best model during training using a custom callback:

      import xgboost as xgb
      from sklearn.datasets import make_classification
      from sklearn.metrics import roc_auc_score
    
    
      class SaveBestModelCallback(xgb.callback.TrainingCallback):
          def __init__(self, save_path='best_model.json'):
              self.best_score = None
              self.best_iteration = None
              self.save_path = save_path
    
          def after_iteration(self, model, epoch, evals_log):
              # Monitor the AUC score from the evals_log
              score = evals_log['valid']['auc'][-1]
              if self.best_score is None or score > self.best_score:
                  self.best_score = score
                  self.best_iteration = epoch
                  model.save_model(self.save_path)  # Save the model to a file
              return False  # Continue training
    
    
      X, y = make_classification(50000, n_features=20, random_state=0)
    
    
      dtrain = xgb.DMatrix(X, label=y)
    
    
      params = {
          'booster': 'gblinear',
          'updater': 'coord_descent',
          'eval_metric': 'auc',
          'eta': 0.01,
          'objective': 'binary:logistic',
          'n_jobs': 4,
          'random_state': 0
      }
    
      # Initialize the callback
      save_best_model_cb = SaveBestModelCallback()
    
      bst = xgb.train(
          params=params,
          dtrain=dtrain,
          num_boost_round=1000000,
          evals=[(dtrain, 'valid')],
          callbacks=[save_best_model_cb],
          early_stopping_rounds=10
      )
    
      # Load the best model
      best_model = xgb.Booster()
      best_model.load_model(save_best_model_cb.save_path)
    
      # Make predictions with the best model
      predictions = best_model.predict(dtrain)
      auc_score = roc_auc_score(y, predictions)
    
      print(f"Best Iteration: {save_best_model_cb.best_iteration}")
      print(f"Best AUC Score: {save_best_model_cb.best_score}")
      print(f"Final AUC Score: {auc_score}")
    
    

Output:
[0]     valid-auc:0.93851
[1]     valid-auc:0.93851
[2]     valid-auc:0.93851
[3]     valid-auc:0.93851
[4]     valid-auc:0.93851
[5]     valid-auc:0.93851
[6]     valid-auc:0.93851
[7]     valid-auc:0.93851
[8]     valid-auc:0.93851
[9]     valid-auc:0.93851
[10]    valid-auc:0.93851
[11]    valid-auc:0.93851
[12]    valid-auc:0.93851
[13]    valid-auc:0.93851
[14]    valid-auc:0.93851
[15]    valid-auc:0.93850
[16]    valid-auc:0.93850
Best Iteration: 6
Best AUC Score: 0.9385135544606076 
Final AUC Score: 0.9385135544606077

archanajagtap23 avatar Mar 18 '25 05:03 archanajagtap23