dask-xgboost icon indicating copy to clipboard operation
dask-xgboost copied to clipboard

Impossible to reproduce model results

Open sergiocalde94 opened this issue 6 years ago • 49 comments

I´ve just opened this issue in the dask repo, but maybe here is better...

I´m using dask for implementing a data pipeline with dask dataframes and dask ml in a Yarn Cluster.

When I build an XGBoost model, the results are always different, even if I manually fix a seed with da.random.seed().

import dask_xgboost as dxgb


params = {'objective': 'binary:logistic', 'n_estimators': 420,
           'max_depth': 5, 'eta': .01,
          'subsample': .8, 'colsample_bytree': .8,
          'learning_rate': .05, 'scale_pos_weight': 1}

bst = dxgb.train(client, params, fitted.transform(X), y)

Is it possible to reproduce the results of a dask model like the one in local using sklearn instead of dask ml???

sergiocalde94 avatar Apr 12 '19 10:04 sergiocalde94

The problem is when I run the model in cluster mode (not local). It´s a Yarn Cluster as I mentioned before.

sergiocalde94 avatar Apr 12 '19 11:04 sergiocalde94

When I build an XGBoost model, the results are always different, even if I manually fix a seed with da.random.seed().

da.random.seed has no effect on dask-xgboost, so that definitely won't work. Currently it looks like we don't support setting a random seed for this library, but we should be able to. I'm not super familiar with xgboost, but it looks like you should be able to set the seed by adding seed to the params, which will be forwarded to every call to xgb.train (this may be non-optimal though, we may want a different seed per task).

You may try this and see if things work (untested).

import dask_xgboost as dxgb

params = {'objective': 'binary:logistic', 'n_estimators': 420,
          'max_depth': 5, 'eta': .01,
          'subsample': .8, 'colsample_bytree': .8,
          'learning_rate': .05, 'scale_pos_weight': 1, 'seed': 1234}

bst = dxgb.train(client, params, fitted.transform(X), y)

Provided X and y are consistently partitioned, and seed can be passed this way I would suspect consistent results. XGBoost also has some non-determinism inherent to it (see https://xgboost.readthedocs.io/en/latest/faq.html#slightly-different-result-between-runs). @TomAugspurger would know more, but he may be currently busy.

jcrist avatar Apr 12 '19 15:04 jcrist

Sorry, I had to tell to you that I also tested with seed parameter and it's not reproducible :(

sergiocalde94 avatar Apr 12 '19 15:04 sergiocalde94

Ok. This may have to do with how we're using xgboost, or it may be inherent to xgboost (as I mentioned above). I'm not the person to figure this out, Tom likely knows more here.

jcrist avatar Apr 12 '19 15:04 jcrist

@TomAugspurger can you reply please? :(

sergiocalde94 avatar Apr 23 '19 10:04 sergiocalde94

I’m on parental leave for the next couple weeks. Could you try debugging it further yourself?

On Apr 23, 2019, at 05:10, Sergio Calderón Pérez-Lozao [email protected] wrote:

@TomAugspurger can you reply please? :(

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger avatar Apr 23 '19 11:04 TomAugspurger

When you are saying not replicable, do you mean the model itself? Or its prediction? If it is the prediction, is it the probability or the class label?

One thing to note that if you don't specify tree_method then xgboost backend automatically picks approx as its tree method. Maybe you can fix it to be exact in the params and try it again?

DigitalPig avatar Apr 24 '19 12:04 DigitalPig

Sorry @TomAugspurger ;(

Hi @DigitalPig,

The point is that if I build two xgboost models with exactly the same parameters it doesn´t return the same model because the importances are different. My preprocessing code is this (df_train is a dask dataframe):

from sklearn.pipeline import Pipeline
from dask_ml.compose import ColumnTransformer
from dask_ml.impute import SimpleImputer
from dask_ml.preprocessing import OneHotEncoder


FILL_MISSING_NUMERICAL = -99
FILL_MISSING_CATEGORICAL = 'Desconocido'


da.random.seed(42)
columns_numeric = df_train.select_dtypes(include='number').columns
columns_categorical = df_train.select_dtypes(exclude='number').columns
columns_categorical = columns_categorical[columns_categorical != 'variable_350']

# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=FILL_MISSING_NUMERICAL))])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value=FILL_MISSING_CATEGORICAL)),
    ('categorizer', Categorizer()),
    ('onehot', OneHotEncoder(sparse=False))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, columns_numeric),
        ('cat', categorical_transformer, columns_categorical)])

preprocessing_pipeline = Pipeline(steps=[('preprocessor', preprocessor)])

X = df_train.drop('variable_350', axis=1)
y = df_train['variable_350'].astype(int)

fitted = preprocessing_pipeline.fit(X, y)

and then if I make this train two times and show its feature importances they are different:

params = {'objective': 'binary:logistic', 'n_estimators': 420,
          'max_depth': 5, 'eta': .01,
          'subsample': .8, 'colsample_bytree': .8,
          'learning_rate': .05, 'scale_pos_weight': 1,
          'tree_method': 'exact', 'seed': 123}

bst = dxgb.train(client, params, fitted.transform(X), y)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 8))

ax = xgb.plot_importance(bst, ax=ax, height=0.8, max_num_features=20)
ax.grid(True, axis="y")

first model:

model1

second model:

model2

As you can see the results are slightly different. Maybe I´m doing something wrong...

Thanks for your replies

sergiocalde94 avatar Apr 24 '19 16:04 sergiocalde94

Which xgboost version are you using? I know the recent xgboost change the default method of variable importance from weight to gain. The plot you show still uses weight here.

Also, I would try to take a downsampled dataset and train it w/o dask to see if you still get different variable importance.

Last, there are some stochastic options turned on during your training like colsample_by_tree. in theory if you fix the seed (and the seed get transfer everywhere) it shouldn't matter. but I would also try to turn it off to see if you still have the same issue.

What about the prediction of these two models?

DigitalPig avatar Apr 29 '19 03:04 DigitalPig

@DigitalPig sorry for the time I took to answer, I was during my holidays.

I´m using xgboost 0.81, returned by:

import xgboost as xgb


print(xgb.__version__)

With the random options turned off the model also return different importances:

I ran this two times

params = {'objective': 'binary:logistic', 'n_estimators': 420,
          'max_depth': 5, 'eta': .01,
          'subsample': 1, 'colsample_bytree': 1,
          'learning_rate': .05, 'scale_pos_weight': 1,
          'tree_method': 'exact', 'seed': 123}

bst = dxgb.train(client, params, fitted.transform(X), y)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 8))

ax = xgb.plot_importance(bst, ax=ax, height=0.8, max_num_features=20)
ax.grid(True, axis="y")

First execution it returns this importances:

first

And the second time:

secong

BUT when I tried to execute the test for less data (only a subset of 100000 registers), the models returned the same importances even with the stochastic parameters setted to a less than 1 value (subsample .8 or colasample_bytree .8 x.e.).

So maybe it´s because the size of the data??

sergiocalde94 avatar May 06 '19 13:05 sergiocalde94

Any idea for why with more data dask_xgboost doesn´t return the same results for reproducibility?

sergiocalde94 avatar May 22 '19 09:05 sergiocalde94

If you remove dask-xgboost from the equation, and just use XGBoost, are the results deterministic? Is it still deterministic if you use XGBoost distributed training (again, not using dask to set up the distributed xgboost runtime)

TomAugspurger avatar May 22 '19 19:05 TomAugspurger

@TomAugspurger yes! with just using XGboost the results are deterministic with all of the data and xgboost with 30 n_jobs (cores)

sergiocalde94 avatar May 27 '19 22:05 sergiocalde94

Is that 30 cores on one machine or distributed?

On May 27, 2019, at 17:38, Sergio Calderón Pérez-Lozao [email protected] wrote:

@TomAugspurger yes! with just using XGboost the results are deterministic with all of the data and xgboost with 30 n_jobs (cores)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger avatar May 27 '19 22:05 TomAugspurger

@TomAugspurger mmm is in one machine, can I distributed xgboost only with the xgboost library?

sergiocalde94 avatar May 27 '19 22:05 sergiocalde94

Yes, that’s the runtime dask hooks into.

On May 27, 2019, at 17:58, Sergio Calderón Pérez-Lozao [email protected] wrote:

@TomAugspurger mmm is in one machine, can I distributed xgboost only with the xgboost library?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

TomAugspurger avatar May 27 '19 23:05 TomAugspurger

Ok I will test it tomorrow! Thanks

sergiocalde94 avatar May 27 '19 23:05 sergiocalde94

Sorry but I couldn´t test it because in our environment we are usiong a cluster that we are not allow to configure and it´s not possible to execute xgboost in distributed without dask.

Any idea to test it? :(

PS: For me the strangest thing is that with less data the results are reproducible even dask is using also the cluster (dask dashboard show it)

sergiocalde94 avatar Jun 03 '19 17:06 sergiocalde94

I don't have any other ideas at the moment.

On Mon, Jun 3, 2019 at 12:45 PM Sergio Calderón Pérez-Lozao < [email protected]> wrote:

Sorry but I couldn´t test it because in our environment we are usiong a cluster that we are not allow to configure and it´s not possible to execute xgboost in distributed without dask.

Any idea to test it? :(

PS: For me the strangest thing is that with less data the results are reproducible even dask is using also the cluster (dask dashboard show it)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/37?email_source=notifications&email_token=AAKAOIVUUSWVT4OVFCZAUGDPYVKC7A5CNFSM4HFPPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2FFNY#issuecomment-498356919, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQVMXADFPIZXCTU42LPYVKC7ANCNFSM4HFPPOWA .

TomAugspurger avatar Jun 03 '19 20:06 TomAugspurger

@sergiocalde94 Do you have minimal example that reproduces this issue? If so, I can take a look.

mmccarty avatar Oct 16 '19 15:10 mmccarty

I was able to reproduce this error. Taking a look at why this is happening

kylejn27 avatar Oct 24 '19 16:10 kylejn27

I installed both libraries from source, the error seemed to go away. I did some digging and it seems that its a problem with version 0.90 of xgboost.

I'm fairly certain that this is the culprit and was fixed a few days ago in a commit in the master of xgboost here: https://github.com/dmlc/xgboost/commit/7e72a12871eaa0ebc46e863dabe8657e3f0557ad#diff-fd53d68e0037d3512896122d1248d969L1128

kylejn27 avatar Oct 24 '19 23:10 kylejn27

In that case, would recommend requesting if upstream could make a new release.

jakirkham avatar Oct 25 '19 01:10 jakirkham

ok, my conclusion was a bit premature. I ran the example that reproduced the error again today after the issue above was closed and realized that I had accidentally set n_workers=1 in my distributed client, so it wasn't running in distributed mode. I'm going to continue to look into this problem

Here is how I reproduced the bug if anybody else was curious: https://github.com/kylejn27/dask-xgb-randomstate-bug

kylejn27 avatar Oct 25 '19 15:10 kylejn27

No solution yet, but I have interesting information. I was tailing the dask worker logs and noticed a trend. If the thread that the workers were running on were the same between executions of the train method, the feature importance graphs were the same.

I'm not sure if this is an issue with dask-xgboost or dmlc/xgboost but I was able to reproduce this issue on the v1.0 version of dmlc/xgboost native dask integration.

maybe this is expected behavior though? https://xgboost.readthedocs.io/en/latest/faq.html#slightly-different-result-between-runs

kylejn27 avatar Oct 31 '19 16:10 kylejn27

cc @RAMitchell in case he has thoughts on what might be going on here.

mrocklin avatar Nov 04 '19 16:11 mrocklin

(or knows someone who can take a look)

mrocklin avatar Nov 04 '19 16:11 mrocklin

cc also @trivialfis

mrocklin avatar Nov 07 '19 16:11 mrocklin

Yup. We are still struggling with some blocking issues to make a new release.

trivialfis avatar Nov 07 '19 16:11 trivialfis

Ah ok. Good to know that this is on your radar. Thanks Jiaming!

On Thu, Nov 7, 2019 at 8:59 AM Jiaming Yuan [email protected] wrote:

Yup. We are still struggling with some blocking issues to make a new release.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/37?email_source=notifications&email_token=AACKZTE2O5JFXARJNCFZKO3QSRCO3A5CNFSM4HFPPOWKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDNC4OI#issuecomment-551169593, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTFGDK5CF3TOZGRSLA3QSRCO3ANCNFSM4HFPPOWA .

mrocklin avatar Nov 07 '19 17:11 mrocklin