dask-xgboost icon indicating copy to clipboard operation
dask-xgboost copied to clipboard

AttributeError when using GridSearchCV with XGBClassifier

Open mateuszkaleta opened this issue 6 years ago • 12 comments

Hello,

I'm working on a small proof of concept. I use dask in my project and would like to use the XGBClassifier. I also need a parameter search and, of course, cross-validation mechanisms.

Unfortunately, when fitting the dask_xgboost.XGBClassifier, I get the following error:

Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_xgboost\core.py", line 97, in _train AttributeError: 'DataFrame' object has no attribute 'to_delayed'

Although I call .fit() with two dask objects, somehow it becomes a pandas.DataFrame later on.

Here's the code I'm using:

import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from dask_xgboost import XGBClassifier
from distributed import Client
from sklearn.datasets import load_iris

if __name__ == '__main__':

    client = Client()

    data = load_iris()

    x = pd.DataFrame(data=data['data'], columns=data['feature_names'])
    x = dd.from_pandas(x, npartitions=2)

    y = pd.Series(data['target'])
    y = dd.from_pandas(y, npartitions=2)

    estimator = XGBClassifier(objective='multi:softmax', num_class=4)
    grid_search = GridSearchCV(
        estimator,
        param_grid={
            'n_estimators': np.arange(15, 105, 15)
        },
        scheduler='threads'
    )

    grid_search.fit(x, y)
    results = pd.DataFrame(grid_search.cv_results_)
    print(results.to_string())

I use the packages in the following versions:

pandas==0.23.3
numpy==1.15.1
dask==0.20.0
dask-ml==0.11.0
dask-xgboost==0.1.5

Note that I don't get this exception when using sklearn.ensemble.GradientBoostingClassifier.

Any help would be appreciated.

Mateusz

mateuszkaleta avatar Nov 06 '18 09:11 mateuszkaleta

Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.

Will try to get a release out soon.

TomAugspurger avatar Nov 06 '18 13:11 TomAugspurger

Does our GridSearchCV even handle dask-ml estimators? I thought that it was mostly optimzied for parameter searches on scikit-learn estimators.

On Tue, Nov 6, 2018 at 8:15 AM Tom Augspurger [email protected] wrote:

Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.

Will try to get a release out soon.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-xgboost/issues/31#issuecomment-436248442, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszMwRfTqT7u-1Ft4AB5OTeLFQXlpRks5usYt5gaJpZM4YQCV_ .

mrocklin avatar Nov 06 '18 13:11 mrocklin

I assume by "dask-ml estimators" you mean dask data objects? dask-ml.model_selection.GridSearchCV should work fine on either, but you have the requirement that the underlying estimator being searched over supports whatever is passed to it (and doesn't blow up memory).

When dask_xgboost encounters a pandas or NumPy object, it just trains the Booster locally. I wonder if that should be done on a worker, in case you have resources like a GPU you want used.

TomAugspurger avatar Nov 06 '18 13:11 TomAugspurger

Thanks for the response.

Can you try with master? Older versions didn't properly handle pandas / numpy objects passed to train, but I think that's fixed now.

Okay, I've tried with master, but now another problem appears:

Traceback (most recent call last):
  File "C:/(...)/aijin-prescoring/aijin/prescoring/sandbox/prediction/xgboost_poc/dask_xgb_sample_fail.py", line 30, in <module>
    grid_search.fit(x, y)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\_search.py", line 1200, in fit
    out = scheduler(dsk, keys, num_workers=n_jobs)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 501, in get_async
    raise_exception(exc, tb)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\compatibility.py", line 112, in reraise
    raise exc
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 272, in execute_task
    result = _execute_task(task, data)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask\local.py", line 253, in _execute_task
    return func(*args2)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 322, in fit_and_score
    est_and_time = fit(est, X_train, y_train, error_score, fields, params, fit_params)
  File "C:\(...)\Anaconda_3.5.1\envs\prescoring\lib\site-packages\dask_ml\model_selection\methods.py", line 242, in fit
    est.fit(X, y, **fit_params)
  File "C:\(...)\dask-xgboost-master\dask_xgboost\core.py", line 326, in fit
    classes = classes.compute()
AttributeError: 'numpy.ndarray' object has no attribute 'compute'

mateuszkaleta avatar Nov 07 '18 06:11 mateuszkaleta

Whoops, I've accidentally been running your script on my branch for https://github.com/dask/dask-xgboost/pull/28, which is fixing this exact issue :) I didn't realize that wasn't merged.

I'm going to kick off the Ci again, and then merge it.

TomAugspurger avatar Nov 07 '18 13:11 TomAugspurger

Hah, glad to read this!

Thank you.

mateuszkaleta avatar Nov 08 '18 06:11 mateuszkaleta

Hi!

Are there any updates on this issue?

I'm heading the same problem - and the PR unfortunately did not get merged, as the CI pipeline failed.

ajdani avatar Jan 23 '19 09:01 ajdani

I don't know if it works for you, but you might be interested in the original xgboost's external memory API.

I've ended up searching hyperparameters with hyperopt, training on large data using the external memory API, reading the data from multiple csv files without dask (currently, I use dask only for the preprocessing part).

mateuszkaleta avatar Jan 23 '19 11:01 mateuszkaleta

I was able to install the branch from #28 and it works for my use case. @TomAugspurger I would be interested in helping solve the CI problems but I don't know where to begin (the error is in multiprocessing when using distributed.utils_test.cluster), so if you would welcome help and be willing to point me in the right direction just ping me. No worries if that is more trouble than it is worth.

quartox avatar Mar 01 '19 21:03 quartox

I spent another couple hours on this with no luck... It's just hard to work around xgboot's behavior of basically doing sys.exit(0) when you try to init their workers twice within a thread. In theory keeping the initialization state as a thread-local should suffice, but I haven't been able to make that work yet, sorry. I don't think I'll have any more time to work on it this week.

FYI, the sparse tests seem to have started failing with the latest xgboost. They're no longer being duck-typed as sparse arrays.

TomAugspurger avatar Mar 04 '19 12:03 TomAugspurger

Is there any update on this issue? I am also encounring the same problem.

pasayatpravat avatar May 11 '19 09:05 pasayatpravat

Still open. You can apply https://github.com/dask/dask-xgboost/pull/28. IIRC there are some issues with the CI / testing on master, but no one has had time to resole them (LMK if you're interested in working on it).

TomAugspurger avatar May 13 '19 15:05 TomAugspurger