dask-ml For a single record data frame train_test_split() sometimes assigns this single record to test set.

Describe the issue:

Disclaimer: I know the bug looks silly but I still wanted to give a heads up.

For a single data frame with only 1 record train_test_split() sometimes returns empty train set and test set with 1 record - is that desired behavior?

Minimal Complete Verifiable Example:

import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split


if __name__ == '__main__':

    for _ in range(20):

        df = pd.DataFrame({'x0': [0], 'x1': [1], 'y': [2]})

        ddf = dd.from_pandas(df, npartitions=1)
        x = ddf[['x0', 'x1']]
        y = ddf['y']

        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

        if x_train.shape[0].compute() == 0:
            print('x_train is empty!')
            break

Anything else we need to know?:

Nope

Environment:

Dask version: 2023.5.0
Dask ML version: 2023.3.24
Python version: 3.8.15
Operating System: Ubuntu 22.04
Install method (conda, pip, source): pip

Jun 29 '23 16:06 KWiecko

What's the behavior of scikit-learn here? We should match that, unless there's some reason not to.

One thing to note: we can't check the length of the DataFrame / array during graph construction. So if scikit-learn does any kind of length check, then we won't be able to (easily) match that behavior.

Jul 02 '23 12:07 TomAugspurger

The following code (which should be equivalent to the dask code above):

import pandas as pd
from sklearn.model_selection import train_test_split

if __name__ == '__main__':
    for _ in range(20):

        df = pd.DataFrame({'x0': [0], 'x1': [1], 'y': [2]})

        x = df[['x0', 'x1']]
        y = df['y']

        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
        # line below throws identical error as line above
        # x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)

        if x_train.shape[0].compute() == 0:
            print('x_train is empty!')
            break

throws a following error:

Traceback (most recent call last):
  File "/home/kw/Projects/upwork/gym/src/debug/fail_during_conversion.py", line 33, in <module>
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
  File "/home/kw/Projects/venvs/gym-test-venv/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2562, in train_test_split
    n_train, n_test = _validate_shuffle_split(
  File "/home/kw/Projects/venvs/gym-test-venv/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2236, in _validate_shuffle_split
    raise ValueError(
ValueError: With n_samples=1, test_size=None and train_size=0.7, the resulting train set will be empty. Adjust any of the aforementioned parameters.

So it looks like default behavior for this case is raise?

Jul 02 '23 12:07 KWiecko

hey can i work on this issue?

Oct 13 '24 06:10 sameeksha-sunilkumar

Sure, thank.

Oct 15 '24 12:10 TomAugspurger