dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

DaskXGBClassifier and dask-ml GridSearchCV throws TypeErrors with Dask Arrays.

Open mseflek opened this issue 3 years ago • 10 comments

I believe this is related to #521. When using dask-ml with xgboost.dask.DaskXGBClassifier and dask_ml.model_selection.GridSearchCV, you run into TypeErrors suggesting that the estimator expects dask.array.core.Array types, even while using Dask Arrays.

MCVE:

import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from xgboost.dask import DaskXGBClassifier
from dask.distributed import Client
from sklearn.datasets import load_iris

if __name__ == '__main__':

    client = Client()

    data = load_iris()
    x = da.from_array(data.data)
    y = da.from_array(data.target)

    estimator = DaskXGBClassifier(objective='multi:softmax', num_class=4)
    grid_search = GridSearchCV(
        estimator,
        param_grid={
            'n_estimators': np.arange(15, 105, 15)
        },
    )

    grid_search.fit(x, y)
    results = pd.DataFrame(grid_search.cv_results_)
    print(results.to_string())

Error:

distributed.worker - WARNING - Compute Failed                                                                                                                   
Function:  fit_and_score                                                                                                                                        
args:      (DaskXGBClassifier(num_class=4, objective='multi:softmax'), <dask_ml.model_selection.methods.CVCache object at 0x7f0be9fa4f40>, array([[5.1, 3.5, 1.4
, 0.2],                                                                                                                                                         
       [4.9, 3. , 1.4, 0.2],                                                                                                                                    
       [4.7, 3.2, 1.3, 0.2],                                                                                                                                    
       [4.6, 3.1, 1.5, 0.2],                                                                                                                                    
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
...
                                                                                                                                                  kwargs:    {}                                                                                                                                                   
Exception: TypeError("Expecting <class 'dask.dataframe.core.DataFrame'> or <class 'dask.array.core.Array'>.  Got <class 'numpy.ndarray'>")

Environment:

  • Dask version: 2021.04.0
  • Dask-ml version: 1.9.0
  • Python version: 3.8.5
  • Operating System: Ubuntu 20.04
  • Install method (conda, pip, source): pip

mseflek avatar May 16 '21 05:05 mseflek

I think the expectation is that param_grid is given a Dask Array as well

jakirkham avatar May 18 '21 03:05 jakirkham

Unfortunately that doesn't seem to be the case. I changed it to:

param_grid={
            'n_estimators': da.from_array(np.arange(15, 105, 15))
        },

and got this:

ValueError: Parameter grid for parameter (n_estimators) needs to be a list or numpy array, but got (<class 'dask.array.core.Array'>). Single values need to be wrapped in a list with one element.

mseflek avatar May 18 '21 04:05 mseflek

cc @hcho3 @trivialfis (in case either of you have thoughts 🙂)

jakirkham avatar May 18 '21 04:05 jakirkham

  • TL;DR: Use xgb.XGBClassifier with dask_ml.model_selection.GridSearchCV.

  • Longer story: The GridSearchCV is designed to extend single node algorithms, like estimators in scikit-learn that accepts numpy array. However, xgb.dask.DaskXGBClassifier itself is distributed and accepts dask collections only. The GridSearchCV divides data and runs the estimators with different hyper-parameter on some local data partitions, instead of running them on the distributed data structure. To support a distributed GridSearchCV, we need to implement a new version of this function that can pass the dask collection directly into estimators.

trivialfis avatar May 18 '21 06:05 trivialfis

I see. Thanks for the insight @trivialfis and @jakirkham. Not sure if this is already in the documentation but would be great if that could be clarified.

I'll run some benchmarks on my end as well, but want to ask if there's a best practice here. What would you suggest if you are in a single node, multi-GPU situation and want to use GridSearchCV and XGBoost for hyperparameter tuning? Just to get it to work I'm using xgb.dask.DaskXGBClassifier and Scikit's GridSearchCV but wondering if a different strategy would be more efficient.

Also should I convert this to a feature request so we can keep track of your suggested revision?

Thanks!

mseflek avatar May 18 '21 18:05 mseflek

Yeah we were working on getting some documentation moved over as well. Maybe it is worth looking at PR ( https://github.com/dmlc/xgboost/pull/6970 ) to see if this answers your question or if there are things we should be capturing there?

@trivialfis should be able to advise better on usage details. So will let him follow up on those

jakirkham avatar May 18 '21 18:05 jakirkham

Answering my own benchmarking question: on 4 x Nvidia V100 GPUs:

xgboost.XGBClassifier with dask_ml.model_selection.GridSearchCV

import dask.array as da
import dask.dataframe as dd
import numpy as np
from sklearn.datasets import load_iris

from dask_cuda import LocalCUDACluster
from dask_ml.model_selection import GridSearchCV
from xgboost import XGBClassifier
from dask.distributed import Client
import time

cluster = LocalCUDACluster()
client =  Client(cluster)

data = load_iris()
x = da.from_array(data.data)
y = da.from_array(data.target)

start = time.time()
estimator = XGBClassifier(objective='multi:softmax', num_class=4, tree_method='gpu_hist')

grid_search = GridSearchCV(
    estimator,
    param_grid={
        'n_estimators': np.arange(15, 105, 15)
    },
)

grid_search.fit(x, y)
print(time.time() - start)

Output: 32 seconds

xgboost.dask.DaskXGBClassifier with sklearn.model_selection.GridSearchCV

import dask.array as da
import dask.dataframe as dd
import numpy as np
from sklearn.datasets import load_iris

from sklearn.model_selection import GridSearchCV as skGridCV
from xgboost.dask import DaskXGBClassifier
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import time

cluster = LocalCUDACluster()
client =  Client(cluster)

data = load_iris()
x = da.from_array(data.data)
y = da.from_array(data.target)

start = time.time()
estimator = DaskXGBClassifier(objective='multi:softmax', num_class=4, tree_method='gpu_hist')
estimator.client = client

grid_search = skGridCV(
    estimator,
    param_grid={
        'n_estimators': np.arange(15, 105, 15)
    },
)

grid_search.fit(x, y)
print(time.time() - start)

Output: 11 seconds.

mseflek avatar May 18 '21 18:05 mseflek

I think using the sklearn grid search will pull the data to workers arbitrarily.

Also should I convert this to a feature request so we can keep track of your suggested revision?

Yup, that would be great.

trivialfis avatar May 19 '21 04:05 trivialfis

I think using the sklearn grid search will pull the data to workers arbitrarily.

Meaning that the results from the GridsearchCV will be invalid?

mseflek avatar May 19 '21 04:05 mseflek

The result should be correct. Just quite inefficient.

trivialfis avatar May 19 '21 06:05 trivialfis