dask-xgboost
dask-xgboost copied to clipboard
Ensure that training and testing data align
Currently if you provide training and testing data that have the same number of partitions, but a different number of rows per partition then the user will get a non-informative error.
Given that we need to have all the data in memory anyway, we could just fix this for the user and balance partitions for them.
cc @jrbourbeau
Thanks for opening up this issue @mrocklin. Below is an example to reproduce the issue
import numpy as np
import dask.dataframe as dd
from dask.distributed import Client
from dask_xgboost import XGBClassifier
client = Client(processes=True,
n_workers=2,
threads_per_worker=1,
memory_limit='3GB')
# Create dataset
np.random.seed(2)
a = np.random.rand(100, 10)
df = pd.DataFrame(a, columns=[f'feature_{i}' for i in range(a.shape[1])])
X = dd.from_pandas(df.iloc[:, :-1], chunksize=50)
y = dd.from_pandas(df.iloc[:, -1], chunksize=51)
# Print out length of each parition for X and y
print(X.map_partitions(len).compute())
print(y.map_partitions(len).compute())
# Fit dask-xgboost classifier
clf = XGBClassifier()
clf.fit(X, y)
Running this example will output
0 50
1 50
dtype: int64
for the X partition lengths, and
0 51
1 49
dtype: int64
for the y partition lengths (note they are different). And then an error message.
Traceback details
distributed.worker - WARNING - Compute Failed
Function: train_part
args: ({'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091}, {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'nthread': 1, 'objective': 'multi:softprob', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': 1, 'subsample': 1, 'num_class': 100}, [( feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8
0 0.435995 0.025926 0.549662 0.435322 0.420368 0.330335 0.204649 0.619271 0.299655
1 0.621134 0.529142 0.134580 0.513578 0.184440 0.785335 0.853975 0.494237 0.846561
2 0.505246 0.065287 0.428122 0.096531 0.127160 0.596745 0.226012 0.106946 0.220306
3 0.467787 0.201743 0.640407 0.483070 0.505237 0.386893 0.793637 0.580004 0.162299
4 0.964551 0.50000
kwargs: {'dmatrix_kwargs': {'feature_names': Index(['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4',
'feature_5', 'feature_6', 'feature_7', 'feature_8'],
dtype='object')}, 'num_boost_round': 100}
Exception: XGBoostError('Long error message', "b'[11:49:52] src/objective/multiclass_obj.cc:43: Check failed: preds->Size() == (static_cast<size_t>(param_.num_class) * info.labels_.size()) SoftmaxMultiClassObj: label size and pred size does not match\\n\\nStack trace returned 7 entries:\\n[bt] (0) 0 libxgboost.dylib 0x0000001c1c95aa51 dmlc::StackTrace() + 305\\n[bt] (1) 1 libxgboost.dylib 0x0000001c1c95a7df dmlc::LogMessageFatal::~LogMessageFatal() + 47\\n[bt] (2) 2 libxgboost.dylib 0x0000001c1c9d249e xgboost::obj::SoftmaxMultiClassObj::GetGradient(xgboost::HostDeviceVector<float>*, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*) + 430\\n[bt] (3) 3 libxgboost.dylib 0x0000001c1c956b66 xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) + 1014\\n[bt] (4) 4 libxgboost.dylib 0x0000001c1c9742f4 XGBoosterUpdateOneIter + 164\\n[bt] (5) 5 _ctypes.cpython-36m-darwin.so 0x0")
distributed.worker - WARNING - Compute Failed
Function: train_part
args: ({'DMLC_NUM_WORKER': 2, 'DMLC_TRACKER_URI': '127.0.0.1', 'DMLC_TRACKER_PORT': 9091}, {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'nthread': 1, 'objective': 'multi:softprob', 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': 0, 'silent': 1, 'subsample': 1, 'num_class': 100}, [( feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8
50 0.266590 0.203876 0.296205 0.8
41868 0.924015 0.978584 0.414331 0.773187 0.168053
51 0.424206 0.844092 0.197020 0.818916 0.072325 0.846108 0.423649 0.140741 0.305917
52 0.446285 0.233264 0.238765 0.693173 0.647769 0.063925 0.812518 0.140184 0.240564
53 0.506255 0.850932 0.929737 0.463310 0.102749 0.474305 0.290458 0.309397 0.242959
54 0.397410 0.00357
kwargs: {'dmatrix_kwargs': {'feature_names': Index(['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4',
'feature_5', 'feature_6', 'feature_7', 'feature_8'],
dtype='object')}, 'num_boost_round': 100}
Exception: XGBoostError('Long error message', "b'[11:49:52] src/objective/multiclass_obj.cc:43: Check failed: preds->Size() == (static_cast<size_t>(param_.num_class) * info.labels_.size()) SoftmaxMultiClassObj: label size and pred size does not match\\n\\nStack trace returned 7 entries:\\n[bt] (0) 0 libxgboost.dylib 0x0000001c1c95aa51 dmlc::StackTrace() + 305\\n[bt] (1) 1 libxgboost.dylib 0x0000001c1c95a7df dmlc::LogMessageFatal::~LogMessageFatal() + 47\\n[bt] (2) 2 libxgboost.dylib 0x0000001c1c9d249e xgboost::obj::SoftmaxMultiClassObj::GetGradient(xgboost::HostDeviceVector<float>*, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*) + 430\\n[bt] (3) 3 libxgboost.dylib 0x0000001c1c956b66 xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) + 1014\\n[bt] (4) 4 libxgboost.dylib 0x0000001c1c9742f4 XGBoosterUpdateOneIter + 164\\n[bt] (5) 5 _ctypes.cpython-36m-darwin.so 0x0")
Traceback (most recent call last):
File "example.py", line 34, in <module>
clf.fit(X, y)
File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 353, in fit
num_boost_round=self.n_estimators)
File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 193, in train
labels, dmatrix_kwargs, **kwargs)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/client.py", line 670, in sync
return sync(self.loop, func, *args, **kwargs)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
six.reraise(*error[0])
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
result[0] = yield future
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 155, in _train
results = yield client._gather(futures)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/distributed/client.py", line 1497, in _gather
traceback)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/six.py", line 692, in reraise
raise value.with_traceback(tb)
File "/Users/jbourbeau/quansight/dask-xgboost/dask_xgboost/core.py", line 90, in train_part
bst = xgb.train(param, dtrain, **kwargs)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/training.py", line 204, in train
xgb_model=xgb_model, callbacks=callbacks)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/training.py", line 74, in _train_internal
bst.update(dtrain, i, obj)
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/core.py", line 1021, in update
dtrain.handle))
File "/Users/jbourbeau/miniconda/envs/quansight/lib/python3.6/site-packages/xgboost/core.py", line 151, in _check_call
raise XGBoostError(_LIB.XGBGetLastError())
xgboost.core.XGBoostError: ('Long error message', "b'[11:49:52] src/objective/multiclass_obj.cc:43: Check failed: preds->Size() == (static_cast<size_t>(param_.num_class) * info.labels_.size()) SoftmaxMultiClassObj: label size and pred size does not match\\n\\nStack trace returned 7 entries:\\n[bt] (0) 0 libxgboost.dylib 0x0000001c1c95aa51 dmlc::StackTrace() + 305\\n[bt] (1) 1 libxgboost.dylib 0x0000001c1c95a7df dmlc::LogMessageFatal::~LogMessageFatal() + 47\\n[bt] (2) 2 libxgboost.dylib 0x0000001c1c9d249e xgboost::obj::SoftmaxMultiClassObj::GetGradient(xgboost::HostDeviceVector<float>*, xgboost::MetaInfo const&, int, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal<float> >*) + 430\\n[bt] (3) 3 libxgboost.dylib 0x0000001c1c956b66 xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*) + 1014\\n[bt] (4) 4 libxgboost.dylib 0x0000001c1c9742f4 XGBoosterUpdateOneIter + 164\\n[bt] (5) 5 _ctypes.cpython-36m-darwin.so 0x0")
The relevant portion of the error message is label size and pred size does not match.
Given that we need to have all the data in memory anyway, we could just fix this for the user and balance partitions for them.
I'm in favor of this idea. I'll open up a PR with a proposal to balance the partitions for input data.
Is there a work around that will ensure training and testing data align?
I am reading several CSVs into Dask Dataframes
train_data = dd.read_csv()
labels = dd.read_csv()
I confirmed that the lengths are the same: 502732
Then run this:
est = xgb.XGBClassifier()
est.fit(train_data, labels)
Then I get this error:
Check failed: preds.Size() == info.labels_.Size() (17398 vs. 213797) labels are not correctly provided
Currently there is no easy workaround. It would be good to provide one though. I would think that this work would happen upstream in dask.dataframe , but I don't have a concrete plan here. Help would be welcome.