dask-xgboost Ensure that training and testing data align

Currently if you provide training and testing data that have the same number of partitions, but a different number of rows per partition then the user will get a non-informative error.

Given that we need to have all the data in memory anyway, we could just fix this for the user and balance partitions for them.

cc @jrbourbeau

Nov 16 '18 13:11 mrocklin

Thanks for opening up this issue @mrocklin. Below is an example to reproduce the issue

import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask_xgboost import XGBClassifier
    
client = Client(processes=True,
                n_workers=2,
                threads_per_worker=1,
                memory_limit='3GB')

# Create dataset
np.random.seed(2)
a = np.random.rand(100, 10)
df = pd.DataFrame(a, columns=[f'feature_{i}' for i in range(a.shape[1])])
X = dd.from_pandas(df.iloc[:, :-1], chunksize=50)
y = dd.from_pandas(df.iloc[:, -1], chunksize=51)

# Print out length of each parition for X and y
print(X.map_partitions(len).compute())
print(y.map_partitions(len).compute())

# Fit dask-xgboost classifier
clf = XGBClassifier()
clf.fit(X, y)

Running this example will output

0    50
1    50
dtype: int64

for the X partition lengths, and

0    51
1    49
dtype: int64

for the y partition lengths (note they are different). And then an error message.

Traceback details

distributed.worker - WARNING -  Compute Failed
Function:  train_part
args:      ({'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091}, {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'nthread': 1, 'objective': 'multi:softprob', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': 1, 'subsample': 1, 'num_class': 100}, [(    feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7  feature_8
0    0.435995   0.025926   0.549662   0.435322   0.420368   0.330335   0.204649   0.619271   0.299655
1    0.621134   0.529142   0.134580   0.513578   0.184440   0.785335   0.853975   0.494237   0.846561
2    0.505246   0.065287   0.428122   0.096531   0.127160   0.596745   0.226012   0.106946   0.220306
3    0.467787   0.201743   0.640407   0.483070   0.505237   0.386893   0.793637   0.580004   0.162299
4    0.964551   0.50000
kwargs:    {'dmatrix_kwargs': {'feature_names': Index(['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4',
       'feature_5', 'feature_6', 'feature_7', 'feature_8'],
      dtype='object')}, 'num_boost_round': 100}
Exception: XGBoostError('Long error message', "b'[11:49:52] src/objective/multiclass_obj.cc:43: Check failed: preds->Size() == (static_cast<size_t>(param_.num_class) * info.labels_.size()) SoftmaxMultiClassObj: label size and pred size does not match\\n\\nStack trace returned 7 entries:\\n[bt] (0) 0   libxgboost.dylib                    0x0000001c1c95aa51 dmlc::StackTrace() + 305\\n[bt] (1) 1   libxgboost.dylib                    0x0000001c1c95a7df dmlc::LogMessageFatal::~LogMessageFatal() + 47\\n[bt] (2) 2   libxgboost.dylib                    0x0000001c1c9d249e xgboost::obj::SoftmaxMultiClassObj::GetGradient(xgboost::HostDeviceVector<float>*, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*) + 430\\n[bt] (3) 3   libxgboost.dylib                    0x0000001c1c956b66 xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) + 1014\\n[bt] (4) 4   libxgboost.dylib             0x0000001c1c9742f4 XGBoosterUpdateOneIter + 164\\n[bt] (5) 5   _ctypes.cpython-36m-darwin.so       0x0")
distributed.worker - WARNING -  Compute Failed
Function:  train_part
args:      ({'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091}, {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'nthread': 1, 'objective': 'multi:softprob', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': 1, 'subsample': 1, 'num_class': 100}, [(    feature_0  feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7  feature_8
50   0.266590   0.203876   0.296205   0.8
41868   0.924015   0.978584   0.414331   0.773187   0.168053
51   0.424206   0.844092   0.197020   0.818916   0.072325   0.846108   0.423649   0.140741   0.305917
52   0.446285   0.233264   0.238765   0.693173   0.647769   0.063925   0.812518   0.140184   0.240564
53   0.506255   0.850932   0.929737   0.463310   0.102749   0.474305   0.290458   0.309397   0.242959
54   0.397410   0.00357
kwargs:    {'dmatrix_kwargs': {'feature_names': Index(['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4',
       'feature_5', 'feature_6', 'feature_7', 'feature_8'],
      dtype='object')}, 'num_boost_round': 100}
Exception: XGBoostError('Long error message', "b'[11:49:52] src/objective/multiclass_obj.cc:43: Check failed: preds->Size() == (static_cast<size_t>(param_.num_class) * info.labels_.size()) SoftmaxMultiClassObj: label size and pred size does not match\\n\\nStack trace returned 7 entries:\\n[bt] (0) 0   libxgboost.dylib                    0x0000001c1c95aa51 dmlc::StackTrace() + 305\\n[bt] (1) 1   libxgboost.dylib                    0x0000001c1c95a7df dmlc::LogMessageFatal::~LogMessageFatal() + 47\\n[bt] (2) 2   libxgboost.dylib                    0x0000001c1c9d249e xgboost::obj::SoftmaxMultiClassObj::GetGradient(xgboost::HostDeviceVector<float>*, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*) + 430\\n[bt] (3) 3   libxgboost.dylib                    0x0000001c1c956b66 xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) + 1014\\n[bt] (4) 4   libxgboost.dylib             0x0000001c1c9742f4 XGBoosterUpdateOneIter + 164\\n[bt] (5) 5   _ctypes.cpython-36m-darwin.so       0x0")

Traceback (most recent call last):
  File "example.py", line 34, in <module>
    clf.fit(X, y)
  File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 353, in fit
    num_boost_round=self.n_estimators)
  File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 193, in train
    labels, dmatrix_kwargs, **kwargs)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/client.py", line 670, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
    six.reraise(*error[0])
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
    result[0] = yield future
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 155, in _train
    results = yield client._gather(futures)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/client.py", line 1497, in _gather
    traceback)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 90, in train_part
    bst = xgb.train(param, dtrain, **kwargs)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/training.py", line 204, in train
    xgb_model=xgb_model, callbacks=callbacks)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/training.py", line 74, in _train_internal
    bst.update(dtrain, i, obj)
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/core.py", line 1021, in update
    dtrain.handle))
  File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/core.py", line 151, in _check_call
    raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: ('Long error message', "b'[11:49:52] src/objective/multiclass_obj.cc:43: Check failed: preds->Size() == (static_cast<size_t>(param_.num_class) * info.labels_.size()) SoftmaxMultiClassObj: label size and pred size does not match\\n\\nStack trace returned 7 entries:\\n[bt] (0) 0   libxgboost.dylib                    0x0000001c1c95aa51 dmlc::StackTrace() + 305\\n[bt] (1) 1   libxgboost.dylib                    0x0000001c1c95a7df dmlc::LogMessageFatal::~LogMessageFatal() + 47\\n[bt] (2) 2   libxgboost.dylib                    0x0000001c1c9d249e xgboost::obj::SoftmaxMultiClassObj::GetGradient(xgboost::HostDeviceVector<float>*, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*) + 430\\n[bt] (3) 3   libxgboost.dylib                    0x0000001c1c956b66 xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) + 1014\\n[bt] (4) 4   libxgboost.dylib                 0x0000001c1c9742f4 XGBoosterUpdateOneIter + 164\\n[bt] (5) 5   _ctypes.cpython-36m-darwin.so       0x0")

The relevant portion of the error message is label size and pred size does not match.

Nov 16 '18 17:11 jrbourbeau

Given that we need to have all the data in memory anyway, we could just fix this for the user and balance partitions for them.

I'm in favor of this idea. I'll open up a PR with a proposal to balance the partitions for input data.

Nov 16 '18 17:11 jrbourbeau

Is there a work around that will ensure training and testing data align?

I am reading several CSVs into Dask Dataframes

train_data = dd.read_csv()
labels = dd.read_csv()

I confirmed that the lengths are the same: 502732

Then run this:

est = xgb.XGBClassifier()
est.fit(train_data, labels)

Then I get this error:

Check failed: preds.Size() == info.labels_.Size() (17398 vs. 213797) labels are not correctly provided

Jan 07 '19 12:01 ghost

Currently there is no easy workaround. It would be good to provide one though. I would think that this work would happen upstream in dask.dataframe , but I don't have a concrete plan here. Help would be welcome.

Jan 07 '19 17:01 mrocklin

dask-xgboost dask-xgboost copied to clipboard

Ensure that training and testing data align

dask-xgboost
dask-xgboost copied to clipboard