dask-ml
dask-ml copied to clipboard
DaskXGBClassifier and dask-ml GridSearchCV throws TypeErrors with Dask Arrays.
I believe this is related to #521. When using dask-ml with xgboost.dask.DaskXGBClassifier
and dask_ml.model_selection.GridSearchCV
, you run into TypeErrors suggesting that the estimator expects dask.array.core.Array
types, even while using Dask Arrays.
MCVE:
import dask.array as da
import dask.dataframe as dd
import numpy as np
import pandas as pd
from dask_ml.model_selection import GridSearchCV
from xgboost.dask import DaskXGBClassifier
from dask.distributed import Client
from sklearn.datasets import load_iris
if __name__ == '__main__':
client = Client()
data = load_iris()
x = da.from_array(data.data)
y = da.from_array(data.target)
estimator = DaskXGBClassifier(objective='multi:softmax', num_class=4)
grid_search = GridSearchCV(
estimator,
param_grid={
'n_estimators': np.arange(15, 105, 15)
},
)
grid_search.fit(x, y)
results = pd.DataFrame(grid_search.cv_results_)
print(results.to_string())
Error:
distributed.worker - WARNING - Compute Failed
Function: fit_and_score
args: (DaskXGBClassifier(num_class=4, objective='multi:softmax'), <dask_ml.model_selection.methods.CVCache object at 0x7f0be9fa4f40>, array([[5.1, 3.5, 1.4
, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
...
kwargs: {}
Exception: TypeError("Expecting <class 'dask.dataframe.core.DataFrame'> or <class 'dask.array.core.Array'>. Got <class 'numpy.ndarray'>")
Environment:
- Dask version: 2021.04.0
- Dask-ml version: 1.9.0
- Python version: 3.8.5
- Operating System: Ubuntu 20.04
- Install method (conda, pip, source): pip
I think the expectation is that param_grid
is given a Dask Array as well
Unfortunately that doesn't seem to be the case. I changed it to:
param_grid={
'n_estimators': da.from_array(np.arange(15, 105, 15))
},
and got this:
ValueError: Parameter grid for parameter (n_estimators) needs to be a list or numpy array, but got (<class 'dask.array.core.Array'>). Single values need to be wrapped in a list with one element.
cc @hcho3 @trivialfis (in case either of you have thoughts 🙂)
-
TL;DR: Use
xgb.XGBClassifier
withdask_ml.model_selection.GridSearchCV
. -
Longer story: The
GridSearchCV
is designed to extend single node algorithms, like estimators in scikit-learn that accepts numpy array. However,xgb.dask.DaskXGBClassifier
itself is distributed and accepts dask collections only. TheGridSearchCV
divides data and runs the estimators with different hyper-parameter on some local data partitions, instead of running them on the distributed data structure. To support a distributedGridSearchCV
, we need to implement a new version of this function that can pass the dask collection directly into estimators.
I see. Thanks for the insight @trivialfis and @jakirkham. Not sure if this is already in the documentation but would be great if that could be clarified.
I'll run some benchmarks on my end as well, but want to ask if there's a best practice here. What would you suggest if you are in a single node, multi-GPU situation and want to use GridSearchCV
and XGBoost for hyperparameter tuning? Just to get it to work I'm using xgb.dask.DaskXGBClassifier
and Scikit's GridSearchCV
but wondering if a different strategy would be more efficient.
Also should I convert this to a feature request so we can keep track of your suggested revision?
Thanks!
Yeah we were working on getting some documentation moved over as well. Maybe it is worth looking at PR ( https://github.com/dmlc/xgboost/pull/6970 ) to see if this answers your question or if there are things we should be capturing there?
@trivialfis should be able to advise better on usage details. So will let him follow up on those
Answering my own benchmarking question: on 4 x Nvidia V100 GPUs:
xgboost.XGBClassifier
with dask_ml.model_selection.GridSearchCV
import dask.array as da
import dask.dataframe as dd
import numpy as np
from sklearn.datasets import load_iris
from dask_cuda import LocalCUDACluster
from dask_ml.model_selection import GridSearchCV
from xgboost import XGBClassifier
from dask.distributed import Client
import time
cluster = LocalCUDACluster()
client = Client(cluster)
data = load_iris()
x = da.from_array(data.data)
y = da.from_array(data.target)
start = time.time()
estimator = XGBClassifier(objective='multi:softmax', num_class=4, tree_method='gpu_hist')
grid_search = GridSearchCV(
estimator,
param_grid={
'n_estimators': np.arange(15, 105, 15)
},
)
grid_search.fit(x, y)
print(time.time() - start)
Output: 32 seconds
xgboost.dask.DaskXGBClassifier
with sklearn.model_selection.GridSearchCV
import dask.array as da
import dask.dataframe as dd
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV as skGridCV
from xgboost.dask import DaskXGBClassifier
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
import time
cluster = LocalCUDACluster()
client = Client(cluster)
data = load_iris()
x = da.from_array(data.data)
y = da.from_array(data.target)
start = time.time()
estimator = DaskXGBClassifier(objective='multi:softmax', num_class=4, tree_method='gpu_hist')
estimator.client = client
grid_search = skGridCV(
estimator,
param_grid={
'n_estimators': np.arange(15, 105, 15)
},
)
grid_search.fit(x, y)
print(time.time() - start)
Output: 11 seconds.
I think using the sklearn grid search will pull the data to workers arbitrarily.
Also should I convert this to a feature request so we can keep track of your suggested revision?
Yup, that would be great.
I think using the sklearn grid search will pull the data to workers arbitrarily.
Meaning that the results from the GridsearchCV will be invalid?
The result should be correct. Just quite inefficient.