xgboost icon indicating copy to clipboard operation
xgboost copied to clipboard

boosters that should be training identically are not

Open bitfilms opened this issue 4 years ago • 6 comments

Hello -

We've found an issue where models that should end up identical after training (assuming a deterministic random number generator) do not under all conditions. See the following code in which we train 4 boosters with identical parameter sets on the same data for 3 rounds each.

Two are trained for 3 rounds each in series, aka, we train booster 1 for three rounds before moving on to booster 2. Their predictions are identical and always appear to come out that way (as expected because they start with the same seed even though there is random sampling going on with the colsample_* parameters).

However, two are trained in an "interleaved" fashion, aka, we train one booster for one round, then train the second booster for one round, then repeat until both have been trained for 3 rounds. Those boosters do not have identical predictions to either of the first two or each other. Even though they too were trained with the same seed, params, and training data.

It appears that any one of the colsample_* parameters being < 1.0 will trigger this effect. We suspect that a random seed in the module instead of the booster is preserving its state between calls to xgb.train() which puts the training out of sync when you train on different boosters.

If you uncomment the two lines in our code featuring the variable clear_the_pipes and re-run, you will see that calling xgb.train with xgb_model=None does seem to reset things a bit for the interleaved boosters. Indeed, this is the only explanation for why the first two models train identically. Because if the random state were not reset after the first 3 calls to xgb.train(), the second model would come out differently.

We feel that even with colsample_* params impacting the training outcomes, the results should be repeatable and all four of our boosters should end up the same. Perhaps a random seed should be kept on a per-booster basis so that every call to xgb.train() will set the seed accordingly.

Thanks in advance for your help with this one!

import xgboost as xgb
import pandas as pd
import numpy as np

train_df = pd.DataFrame( np.random.rand( 100, 10))
valid_df = pd.DataFrame( np.random.rand( 100, 1))

trainDM  = xgb.DMatrix( data=train_df, label=valid_df )


param_dict = {
 'colsample_bytree': .25,
 'colsample_bylevel': 1.0,
 'colsample_bynode': 1.0,
 'seed': 4512}

boosters_trained_in_series = [ None, None ]
boosters_trained_interleaved = [ None, None ]

# clear_the_pipes = None

for i in range( 0, 2):
    for a in range( 0, 3):
        boosters_trained_in_series[ i] = xgb.train( param_dict, trainDM, num_boost_round=1, xgb_model=boosters_trained_in_series[ i])

for a in range(0, 3):
    for i in range( 0, 2):
        boosters_trained_interleaved[ i] = xgb.train( param_dict, trainDM, num_boost_round=1, xgb_model=boosters_trained_interleaved[ i])
        # clear_the_pipes = xgb.train(param_dict, trainDM, num_boost_round=1, xgb_model=None )

predictions_df = pd.DataFrame()

all_boosters = boosters_trained_in_series + boosters_trained_interleaved

for i in range( len( all_boosters)):
    predictions_df[ i] = all_boosters[ i].predict( trainDM)

predictions_df

This is all in python 3.79, xgboost 1.3.3.

bitfilms avatar Feb 17 '21 06:02 bitfilms