dask-xgboost
dask-xgboost copied to clipboard
AttributeError when using GridSearchCV with XGBClassifier
Hello,
I'm working on a small proof of concept. I use dask in my project and would like to use the XGBClassifier. I also need a parameter search and, of course, cross-validation mechanisms.
Unfortunately, when fitting the dask_xgboost.XGBClassifier, I get the following error:
Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_xgboost\core.py", line 97, in _train AttributeError: 'DataFrame' object has no attribute 'to_delayed'
Although I call .fit() with two dask objects, somehow it becomes a pandas.DataFrame later on.
Here's the code I'm using:
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from dask_xgboost import XGBClassifier
from distributed import Client
from sklearn.datasets import load_iris
if __name__ == '__main__':
client = Client()
data = load_iris()
x = pd.DataFrame(data=data['data'], columns=data['feature_names'])
x = dd.from_pandas(x, npartitions=2)
y = pd.Series(data['target'])
y = dd.from_pandas(y, npartitions=2)
estimator = XGBClassifier(objective='multi:softmax', num_class=4)
grid_search = GridSearchCV(
estimator,
param_grid={
'n_estimators': np.arange(15, 105, 15)
},
scheduler='threads'
)
grid_search.fit(x, y)
results = pd.DataFrame(grid_search.cv_results_)
print(results.to_string())
I use the packages in the following versions:
pandas==0.23.3
numpy==1.15.1
dask==0.20.0
dask-ml==0.11.0
dask-xgboost==0.1.5
Note that I don't get this exception when using sklearn.ensemble.GradientBoostingClassifier.
Any help would be appreciated.
Mateusz
Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.
Will try to get a release out soon.
Does our GridSearchCV even handle dask-ml estimators? I thought that it was mostly optimzied for parameter searches on scikit-learn estimators.
On Tue, Nov 6, 2018 at 8:15 AM Tom Augspurger [email protected] wrote:
Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.
Will try to get a release out soon.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/31#issuecomment-436248442, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszMwRfTqT7u-1Ft4AB5OTeLFQXlpRks5usYt5gaJpZM4YQCV_ .
I assume by "dask-ml estimators" you mean dask data objects? dask-ml.model_selection.GridSearchCV should work fine on either, but you have the requirement that the underlying estimator being searched over supports whatever is passed to it (and doesn't blow up memory).
When dask_xgboost encounters a pandas or NumPy object, it just trains the Booster locally. I wonder if that should be done on a worker, in case you have resources like a GPU you want used.
Thanks for the response.
Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.
Okay, I've tried with master, but now another problem appears:
Traceback (most recent call last):
File "C:/(...)/aijin-prescoring/aijin/prescoring/sandbox/prediction/xgboost_poc/dask_xgb_sample_fail.py", line 30, in <module>
grid_search.fit(x, y)
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\_search.py", line 1200, in fit
out = scheduler(dsk, keys, num_workers=n_jobs)
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\threaded.py", line 76, in get
pack_exception=pack_exception, **kwargs)
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 501, in get_async
raise_exception(exc, tb)
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\compatibility.py", line 112, in reraise
raise exc
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 272, in execute_task
result = _execute_task(task, data)
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 253, in _execute_task
return func(*args2)
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 322, in fit_and_score
est_and_time = fit(est, X_train, y_train, error_score, fields, params, fit_params)
File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 242, in fit
est.fit(X, y, **fit_params)
File "C:\(...)\dask-xgboost-master\dask_xgboost\core.py", line 326, in fit
classes = classes.compute()
AttributeError: 'numpy.ndarray' object has no attribute 'compute'
Whoops, I've accidentally been running your script on my branch for https://github.com/dask/dask-xgboost/pull/28, which is fixing this exact issue :) I didn't realize that wasn't merged.
I'm going to kick off the Ci again, and then merge it.
Hah, glad to read this!
Thank you.
Hi!
Are there any updates on this issue?
I'm heading the same problem - and the PR unfortunately did not get merged, as the CI pipeline failed.
I don't know if it works for you, but you might be interested in the original xgboost's external memory API.
I've ended up searching hyperparameters with hyperopt, training on large data using the external memory API, reading the data from multiple csv files without dask (currently, I use dask only for the preprocessing part).
I was able to install the branch from #28 and it works for my use case. @TomAugspurger I would be interested in helping solve the CI problems but I don't know where to begin (the error is in multiprocessing when using distributed.utils_test.cluster), so if you would welcome help and be willing to point me in the right direction just ping me. No worries if that is more trouble than it is worth.
I spent another couple hours on this with no luck... It's just hard to work around xgboot's behavior of basically doing sys.exit(0)
when you try to init their workers twice within a thread. In theory keeping the initialization state as a thread-local should suffice, but I haven't been able to make that work yet, sorry. I don't think I'll have any more time to work on it this week.
FYI, the sparse
tests seem to have started failing with the latest xgboost. They're no longer being duck-typed as sparse arrays.
Is there any update on this issue? I am also encounring the same problem.
Still open. You can apply https://github.com/dask/dask-xgboost/pull/28. IIRC there are some issues with the CI / testing on master, but no one has had time to resole them (LMK if you're interested in working on it).