dask-ml
dask-ml copied to clipboard
Train Test Split Unexpected Behavior
What happened:
_validate_shuffle_split
is not setting the size of test correctly when only train_size
is included. The reverse is also true, train size will be wrong if you only set test_size
.
I passed a dask array to train_test_split
from dask-ml
. The _validate_shuffle_split
method in this function is actually imported from sklearn. (It's possible that this may mean the issue is better posed to sklearn but as it's affecting dask users you might be interested.)
The error you should see when running the reprex is: "ValueError: With n_samples=1, test_size=0.30000000000000004 and train_size=0.7, the resulting train set will be empty. Adjust any of the aforementioned parameters."
This error would be entirely accurate if it said test_size = 0.3
but it's clearly not right as it is. If you manually set train_size
and test_size
both, then the error is correct and behavior is fine.
The smallest specified value that shows an issue as far as I can tell is .66.
Given the incredibly small decimal error on the split default, if you run correct data the error doesn't show itself. However, at scale it eventually results in the test set being microscopically larger than it should be, so this error surfaced in my dask use case because the data was large enough for the test size to calculate out to one extra observation.
What you expected to happen:
dask-ml version of train_test_split
would default the size of test to 0.3 exactly, not 0.30000000000000004.
Minimal Complete Verifiable Example:
from dask_ml.model_selection import train_test_split
import dask.dataframe as dd
from sklearn import datasets
import pandas as pd
import numpy as np
# Grab sample dataset
iris = datasets.load_iris()
iris_pd = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
iris_pd['col1_madeup'] = iris_pd['sepal length (cm)'].astype(str)
iris_pd['col2_madeup'] = iris_pd['sepal width (cm)'].astype(str)
iris_dd = dd.from_pandas(iris_pd, npartitions=10)
def preprocess(dns: dd.DataFrame) -> dd.DataFrame:
dns['target'] = 0
dns = dns.groupby(["target", "col2_madeup"]).col1_madeup.apply(list)
dns = dns.reset_index()
return dns
iris_dd = preprocess(iris_dd)
iris_da = iris_dd.to_dask_array().compute_chunk_sizes()
x_train, x_test, y_train, y_test = train_test_split(
iris_da[:,1],
iris_da[:,2],
train_size=0.7,
#test_size = 0.3, # If you comment this out, the sizes are corrected.
shuffle=True,
)
Anything else we need to know?:
~It seems like the solution choices are to write your own version of _validate_shuffle_split
or I can go over to sklearn and investigate with them, to try and get a fix introduced to their version that dask-ml can just import.~
Never mind, sounds like this a floating point arithmetic dilemma more likely. So some arithmetic remedy like float()
might be the right choice.
Environment: Reproduced on Macbook Pro locally and also on Saturn Cloud Jupyter Labs notebook
- Dask version: 2.27.0
- Python version: 3.8
- Operating System: Mac OS 10.15.6
- Install method (conda, pip, source): pip
Thanks for the report. I think the root cause is that iris_da
has some blocks with very few samples.
In [28]: iris_da.chunks
Out[28]: ((2, 5, 1, 2, 4, 0, 2, 1, 5, 1), (3,))
train_test_split
works blockwise, so we call sklearn.model_selection.train_test_split
on that block with 1 or 0 samples, and there's just no way to split those.
The best option might be to catch the exception and re-raise with a more informative error message, something like "Maybe you want to .rechunk()
before calling train_test_split`. Does that make sense?
That's not the issue, sorry if I was unclear. The issue is that the test sample should not default to 0.30000000000000004
when it is not specified. The error is only shown here so you can see that the number it's returning is not right.
Ah, gotcha. In that case, I think it just comes down to how Python implements floating point arithmetic:
In [3]: 1 - 0.7
Out[3]: 0.30000000000000004
I don't recall the specifics, but I think that either 0.7 or 0.3 can't be represented perfectly in binary. So when you subtract them you get this weird result.
That's a problem, though, because when I use dask my data gets large enough that the rounding error makes my train and test set sizes add up to one more than the entire size of my array, and thus the split fails. I notice that sklearn has some ceiling and floor functions applied strategically, maybe that would be a remedy in this case too.
I'd also note that all numbers greater than .66 appear to result in some error in this regard- and you're probably right that the float arithmetic is the problem- so then just implementing a blanket solution of, say, a floor() wrapper might just solve it.
Requested pointer to the spot where sklearn handles this problem: https://github.com/scikit-learn/scikit-learn/blob/8ea176ae0ca535cdbfad7413322bbc3e54979e4d/sklearn/model_selection/_split.py#L1826-L1841