dask-ml icon indicating copy to clipboard operation
dask-ml copied to clipboard

No support for stratified split in dask_ml.model_selection.train_test_split

Open chauhankaranraj opened this issue 6 years ago • 20 comments
trafficstars

scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.

chauhankaranraj avatar Aug 09 '19 20:08 chauhankaranraj

Agreed. Are you interested in working on this?

On Fri, Aug 9, 2019 at 3:18 PM Karanraj Chauhan [email protected] wrote:

scikit-learn implementation of train test split (sklearn.model_selection.train_test_split) supports splitting data according to class labels (stratified split) by using the argument stratify. This is especially useful when datasets have high class imbalance. It would be really helpful to have this feature in dask_ml as well.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/535?email_source=notifications&email_token=AAKAOITYSEXRW4TUOTW6OFDQDXGIXA5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HEPIOUQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQTYGEUZ6B7ABDQK4TQDXGIXANCNFSM4IKXCDHA .

TomAugspurger avatar Aug 09 '19 20:08 TomAugspurger

Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class).

So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time.

chauhankaranraj avatar Aug 11 '19 23:08 chauhankaranraj

That's great if you're willing to try. Let us know if you get stuck.

On Sun, Aug 11, 2019 at 6:40 PM Karanraj Chauhan [email protected] wrote:

Tempted to say yes, but I don't know the codebase/internals very well (specifically, I'm not sure how we can get a stratified split with blockwise=False not implemented for the ShuffleSplit class).

So it'd be faster if someone more knowledgeable could volunteer. If not then I'd be happy to give it a shot, but it might take some time.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/535?email_source=notifications&email_token=AAKAOISB25FYPBPYRR4KOFTQECPN5A5CNFSM4IKXCDHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4BLGTI#issuecomment-520270669, or mute the thread https://github.com/notifications/unsubscribe-auth/AAKAOIQVJEDB5XD7VSQZO4TQECPN5ANCNFSM4IKXCDHA .

TomAugspurger avatar Aug 12 '19 02:08 TomAugspurger

Hey, Tom, I'm thinking of picking this up. My doubt is:

Say we have a big csv file with 2 categories and two partitions of the data.

So file_0 has only category 0 and 1, file_1 has only category 1.

My first thought was to just use the stratify parameter of scikit-learn, but in this case that wouldn't work. Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

I'd be glad to pick this up, as it would help in some research I'm doing.

tiagofassoni avatar Dec 13 '19 17:12 tiagofassoni

@tiagofassoni great! dask-ml's OneHotEncoder may be helpful here. It will use the Categorical dtype for pandas dataframes. Otherwise you can (or maybe need?) to pass the categories manually as a list / array. Does that make sense?

In other places that just work with arrays, like Incremental, we require that the classes (groups in this case) be specified ahead of time.

TomAugspurger avatar Dec 13 '19 19:12 TomAugspurger

is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful

jerrytim avatar Feb 21 '20 14:02 jerrytim

I’m not aware of any progress. Perhaps Tiago can share a status update.

On Feb 21, 2020, at 7:37 AM, Tim Huang [email protected] wrote:

 is there any luck with this feature request? in the case of huge imbalanced dataset, the stratify argument in train_test_split is useful

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

TomAugspurger avatar Feb 21 '20 14:02 TomAugspurger

Hello, @TomAugspurger, @jerrytim. Got to try my hand on this just last week and... gotta say, I have no idea on how to make it. I don't know why OneHotEncoder would be helpful, if at all.

I was thinking of using something like Pandas' value_counts for the series results and then trying to make a shuffle, but I don't know if such an approach is feasible.

tiagofassoni avatar Feb 26 '20 22:02 tiagofassoni

@TomAugspurger I agree with @tiagofassoni - I'm not sure how OneHotEncoder can be used. But I also don't understand how value_counts can be used - @tiagofassoni could you please elaborate?

There's two things I wanted to bring into discussion that might help us better decide how to implement. IIUC splitting is handled differently for da.Array and dd.Series/dd.DataFrame, correct?

  1. For dd.Series/dd.DataFrame, heavy lifting is done by random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.
  2. For da.Array, heavy lifting is done by ShuffleSplit and _blockwise_slice. Could we get parts of the input array of that belong to a particular class, compute chunks of this subarray, apply the same ShuffleSplit+_blockwise_slice strategy for this subarray, repeat for all classes, and finally concatenate the results? This would be kind of along the same lines as @tiagofassoni's comment:

Another idea would be to compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

chauhankaranraj avatar Mar 08 '20 20:03 chauhankaranraj

random_split but I couldn't find its source code. So I'm not 100% sure how to deal with that case.

That's in dask.dataframe.DataFrame.random_split

compute all the categories beforehand and pass those to the stratify parameter, but seems overly complicated and prone to a ton of edge cases.

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

TomAugspurger avatar Mar 09 '20 21:03 TomAugspurger

That's in dask.dataframe.DataFrame.random_split

Gotcha, thanks! I'll take a look :)

In these cases we typically require the user to provide the set of classes up front. But in this case, do we need to make a pass over the data to determine the frequency of each class? Or can that be done lazily?

Yeah, I agree - having the classes up front would be ideal. We could still compute the classes (da.unique on stratify array) but I don't think that can be done lazily and thus wouldn't be ideal.

Maybe I'm missing something here, but do we really need the frequencies? This might be a little far from optimal, but could we do something along these lines:

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))

chauhankaranraj avatar Mar 09 '20 22:03 chauhankaranraj

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

Note: I'm a data scientist, not a developer...

train_test_pairs = []
for arr in arrays:

    # create subarrays for each class, apply split on subarrays individually
    arr_train_test_pairs = [[], []]
    for ci in classes:
        ci_arr = arr[_stratify==ci]
        ci_arr.compute_chunk_sizes()
        train_idx, test_idx = next(splitter.split(ci_arr))
        arr_train_test_pairs[0].append(_blockwise_slice(ci_arr, train_idx))
        arr_train_test_pairs[1].append(_blockwise_slice(ci_arr, test_idx))

    # concat all train subarr as 1 train arr, all test subarr as 1 test arr
    arr_train_test_pairs[0] = da.concatenate(arr_train_test_pairs[0])
    arr_train_test_pairs[1] = da.concatenate(arr_train_test_pairs[1])
    train_test_pairs.append(arr_train_test_pairs)

return list(itertools.chain.from_iterable(train_test_pairs))

Sklearn uses np.bincount in class StratifiedShuffleSplit in sklearn.model_selection._split to get frequencies and split accordingly.

trail-coffee avatar Mar 16 '20 15:03 trail-coffee

@chauhankaranraj does that split the class subarrays evenly? So it's a stratified train/test with 0.5 test and 0.5 train?

@ericbassett It should split it in whatever train/test ratio is provided as input. The splitter used here is the instance of ShuffleSplit that gets created here. IIUC it takes care of splitting by ratios provided.

I'll submit a WIP PR soon so this discuss becomes more concrete :)

chauhankaranraj avatar Mar 16 '20 15:03 chauhankaranraj

Very nice, makes sense.

trail-coffee avatar Mar 16 '20 15:03 trail-coffee

Hey folks,

I made an attempt to implement the stratified split here. I could do it lazily for dask Series and DataFrames, but not completely lazily for dask Array (calling compute_chunk_sizes()).

Does anyone have ideas to get around this? Would it be possible to "enforce" chunk size instead of computing it? [e.g. if chunk size for the whole array is (x, 10) then chunk size for the part of the array that belongs to a class with weight 15% should be (0.15x, 10)]

Any feedback in general would be highly appreciated :pray:

Also, if you feel this discussion should be moved to a WIP PR, I can open that too.

chauhankaranraj avatar Mar 29 '20 18:03 chauhankaranraj

May be easiest to move to a PR. We might be able to do things lazily for dask array, we'll just probably end up with unknown chunk sizes.

TomAugspurger avatar Mar 30 '20 13:03 TomAugspurger

@TomAugspurger Sure thing. Opened this WIP PR yesterday

chauhankaranraj avatar Apr 04 '20 21:04 chauhankaranraj

Any Progress on this task ? :)

ashokrayal avatar Jun 07 '22 07:06 ashokrayal

I need the stratify feature in tran_split_test as well for my imbalanced dataset. Any updates?

kennylids avatar Jul 28 '22 03:07 kennylids

Hey folks, sorry but I haven't had the chance to continue working on this. I did open a WIP PR (#635) so if anyone would like to fork off of it or just start from scratch, feel free to do so! Let me know if you'd like anything from me in doing so.

chauhankaranraj avatar Jul 31 '22 19:07 chauhankaranraj