For a single record data frame train_test_split() sometimes assigns this single record to test set.
Describe the issue:
Disclaimer: I know the bug looks silly but I still wanted to give a heads up.
For a single data frame with only 1 record train_test_split() sometimes returns empty train set and test set with 1 record - is that desired behavior?
Minimal Complete Verifiable Example:
import pandas as pd
import dask.dataframe as dd
from dask_ml.model_selection import train_test_split
if __name__ == '__main__':
for _ in range(20):
df = pd.DataFrame({'x0': [0], 'x1': [1], 'y': [2]})
ddf = dd.from_pandas(df, npartitions=1)
x = ddf[['x0', 'x1']]
y = ddf['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
if x_train.shape[0].compute() == 0:
print('x_train is empty!')
break
Anything else we need to know?:
Nope
Environment:
- Dask version: 2023.5.0
- Dask ML version: 2023.3.24
- Python version: 3.8.15
- Operating System: Ubuntu 22.04
- Install method (conda, pip, source): pip
What's the behavior of scikit-learn here? We should match that, unless there's some reason not to.
One thing to note: we can't check the length of the DataFrame / array during graph construction. So if scikit-learn does any kind of length check, then we won't be able to (easily) match that behavior.
The following code (which should be equivalent to the dask code above):
import pandas as pd
from sklearn.model_selection import train_test_split
if __name__ == '__main__':
for _ in range(20):
df = pd.DataFrame({'x0': [0], 'x1': [1], 'y': [2]})
x = df[['x0', 'x1']]
y = df['y']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
# line below throws identical error as line above
# x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
if x_train.shape[0].compute() == 0:
print('x_train is empty!')
break
throws a following error:
Traceback (most recent call last):
File "/home/kw/Projects/upwork/gym/src/debug/fail_during_conversion.py", line 33, in <module>
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7)
File "/home/kw/Projects/venvs/gym-test-venv/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2562, in train_test_split
n_train, n_test = _validate_shuffle_split(
File "/home/kw/Projects/venvs/gym-test-venv/lib/python3.8/site-packages/sklearn/model_selection/_split.py", line 2236, in _validate_shuffle_split
raise ValueError(
ValueError: With n_samples=1, test_size=None and train_size=0.7, the resulting train set will be empty. Adjust any of the aforementioned parameters.
So it looks like default behavior for this case is raise?
hey can i work on this issue?
Sure, thank.