KFold cross validation fails with dask dataframes
Describe the issue:
KFold.split doesn't support dask dataframes. With the recent integrations of dask in e.g., xgboost, optuna, it would be very useful if it did. The error message acknowledges that dataframe are not supported and should be converted to dask arrays. With modern ML workflows, this isn't ideal since datasets commonly contain fields of many types (float, int, bool, categorical).
Minimal Complete Verifiable Example:
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split, KFold
ddf = dd.demo.make_timeseries()
train, test = train_test_split(ddf) # works
k_folder = KFold(n_splits=5)
for train, test in k_folder.split(ddf): # fails
pass
traceback
TypeError Traceback (most recent call last)
Cell In[12], line 1
----> 1 for train, test in k_folder.split(ddf):
2 pass
File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/model_selection/_split.py:241, in KFold.split(self, X, y, groups)
240 def split(self, X, y=None, groups=None):
--> 241 X = check_array(X)
242 n_samples = X.shape[0]
243 n_splits = self.n_splits
File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/utils.py:197, in check_array(array, accept_dask_array, accept_dask_dataframe, accept_unknown_chunks, accept_multiple_blocks, preserve_pandas_dataframe, remove_zero_chunks, *args, **kwargs)
195 elif isinstance(array, dd.DataFrame):
196 if not accept_dask_dataframe:
--> 197 raise TypeError(
198 "This estimator does not support dask dataframes. "
199 "This might be resolved with one of\n\n"
200 " 1. ddf.to_dask_array(lengths=True)\n"
201 " 2. ddf.to_dask_array() # may cause other issues because "
202 "of unknown chunk sizes"
203 )
204 # TODO: sample?
205 return array
TypeError: This estimator does not support dask dataframes. This might be resolved with one of
1. ddf.to_dask_array(lengths=True)
2. ddf.to_dask_array() # may cause other issues because of unknown chunk sizes
Anything else we need to know?:
We recently worked around this limitation with the following:
def _make_cv(df, num_folds):
frac = [1 / num_folds]*num_folds
splits = df.random_split(frac, shuffle=True)
for i in range(num_folds):
train = [splits[j] for j in range(num_folds) if j != i]
test = splits[i]
yield train, test
for i, (train, test) in enumerate(_make_cv(ddf, n_splits)):
pass
Thanks @phobson. I agree it would be nice if Dask DataFrames were supported here (this would also match scikit-learn's behavior).
cc @mmccarty for visibility in case you, or folks around you, have bandwidth to look into this
Thanks @jrbourbeau and @phobson I'll take a look.