dask-ml
dask-ml copied to clipboard
WIP: Add stratified split feature to model_selection.train_test_split
I took a stab at implementing a solution for issue #535
Adding a WIP label because currently the stratified split is not completely lazily for dask arrays (compute_chunk_sizes being called here). Nonetheless, I think it works fine for dask series and dataframes.
Any feedback would be appreciated :)
My two cents.
- classes can be optional because computing classes from an out-of-core dataset, outside train test split will cost the same.
- If we split classes by classes, does it mean the return train, test datasets are ordered by classes?
I don't think we would want to order by class. Does scikit-learn do that?
On Thu, Jun 18, 2020 at 9:58 AM austinzh [email protected] wrote:
My two cents.
- classes can be optional because computing classes from an out-of-core dataset, outside train test split will cost the same.
- If we split classes by classes, does it mean the return train, test datasets are ordered by classes?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/pull/635#issuecomment-646075107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQKBAVFY55235RQT4DRXITQXANCNFSM4L4RTUDQ .
Yes. But if we check for ci in classes: loop, we will found that we split class by class then concatenate them back.
That implies the return array, for example, the train set looks like
[randomlized_classA, randomlized_classB, randomlized_ClassC] meaning in this PR's implementation, same class stick together.
But If we use the same parameter for scikit-learn's train_test_split, the output will be shuffled.
For example, I run this on un-shuffled, iris.csv.
output1 is the output of sklean's train, test = ms.train_test_split(df, test_size=0.2, random_state=0, shuffle=True, stratify=df['species'])
output2 is the output of this PR. And I only print the species column
output1:
setosa
setosa
setosa
setosa
versicolor
setosa
virginica
virginica
versicolor
virginica
virginica
versicolor
setosa
versicolor
virginica
virginica
setosa
versicolor
versicolor
setosa
virginica
setosa
setosa
virginica
virginica
versicolor
versicolor
setosa
virginica
virginica
versicolor
versicolor
setosa
virginica
virginica
versicolor
virginica
versicolor
virginica
versicolor
versicolor
versicolor
setosa
setosa
versicolor
versicolor
virginica
virginica
versicolor
setosa
virginica
virginica
setosa
setosa
versicolor
versicolor
setosa
setosa
versicolor
virginica
setosa
setosa
versicolor
versicolor
virginica
versicolor
virginica
setosa
setosa
virginica
versicolor
versicolor
setosa
setosa
virginica
versicolor
virginica
setosa
versicolor
virginica
virginica
versicolor
virginica
setosa
versicolor
setosa
setosa
virginica
virginica
versicolor
virginica
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
virginica
setosa
virginica
setosa
virginica
setosa
versicolor
versicolor
versicolor
versicolor
setosa
virginica
virginica
setosa
versicolor
versicolor
virginica
setosa
virginica
virginica
virginica
output2
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
setosa
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
versicolor
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
virginica
I don't think we would want to order by class. Does scikit-learn do that? … On Thu, Jun 18, 2020 at 9:58 AM austinzh @.***> wrote: My two cents. 1. classes can be optional because computing classes from an out-of-core dataset, outside train test split will cost the same. 2. If we split classes by classes, does it mean the return train, test datasets are ordered by classes? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#635 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIQKBAVFY55235RQT4DRXITQXANCNFSM4L4RTUDQ .