dask-ml KFold cross validation fails with dask dataframes

Describe the issue:

KFold.split doesn't support dask dataframes. With the recent integrations of dask in e.g., xgboost, optuna, it would be very useful if it did. The error message acknowledges that dataframe are not supported and should be converted to dask arrays. With modern ML workflows, this isn't ideal since datasets commonly contain fields of many types (float, int, bool, categorical).

Minimal Complete Verifiable Example:

import dask.dataframe as dd
from dask_ml.model_selection import train_test_split, KFold
ddf = dd.demo.make_timeseries()
train, test = train_test_split(ddf)  # works

k_folder = KFold(n_splits=5)
for train, test in k_folder.split(ddf):  # fails
    pass

traceback

TypeError                                 Traceback (most recent call last)
Cell In[12], line 1
----> 1 for train, test in k_folder.split(ddf):
      2     pass

File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/model_selection/_split.py:241, in KFold.split(self, X, y, groups)
    240 def split(self, X, y=None, groups=None):
--> 241     X = check_array(X)
    242     n_samples = X.shape[0]
    243     n_splits = self.n_splits

File ~/mambaforge/envs/ml-example/lib/python3.10/site-packages/dask_ml/utils.py:197, in check_array(array, accept_dask_array, accept_dask_dataframe, accept_unknown_chunks, accept_multiple_blocks, preserve_pandas_dataframe, remove_zero_chunks, *args, **kwargs)
    195 elif isinstance(array, dd.DataFrame):
    196     if not accept_dask_dataframe:
--> 197         raise TypeError(
    198             "This estimator does not support dask dataframes. "
    199             "This might be resolved with one of\n\n"
    200             "    1. ddf.to_dask_array(lengths=True)\n"
    201             "    2. ddf.to_dask_array()  # may cause other issues because "
    202             "of unknown chunk sizes"
    203         )
    204     # TODO: sample?
    205     return array

TypeError: This estimator does not support dask dataframes. This might be resolved with one of

    1. ddf.to_dask_array(lengths=True)
    2. ddf.to_dask_array()  # may cause other issues because of unknown chunk sizes

Anything else we need to know?:

We recently worked around this limitation with the following:

def _make_cv(df, num_folds):
    frac = [1 / num_folds]*num_folds
    splits = df.random_split(frac, shuffle=True)
    for i in range(num_folds):
        train = [splits[j] for j in range(num_folds) if j != i]
        test = splits[i]
        yield train, test

for i, (train, test) in enumerate(_make_cv(ddf, n_splits)):
    pass

Jan 10 '23 07:01 phobson

Thanks @phobson. I agree it would be nice if Dask DataFrames were supported here (this would also match scikit-learn's behavior).

cc @mmccarty for visibility in case you, or folks around you, have bandwidth to look into this

Jan 10 '23 17:01 jrbourbeau

Thanks @jrbourbeau and @phobson I'll take a look.

Jan 11 '23 18:01 mmccarty