skrub icon indicating copy to clipboard operation
skrub copied to clipboard

DOC - shuffle the toxicity dataset in its example

Open jeromedockes opened this issue 10 months ago • 5 comments

Describe the issue linked to the documentation

the dataset in this example is sorted by label: the first 500 tweets are toxic and the rest non-toxic

the example does not explicitly shuffle it, but it still works because cross_validate for classification uses a stratified k fold by default. However that is not immediately obvious, and I think there has been some discussion recently in scikit-learn that it might be better not to stratify by default. So I think it might be good to actually shuffle the dataset.

Not sure where would be the best place to do it -- in the hosted zip file, in the fetcher, in the example, or by passing shuffle=True to cross_validate

WDYT @Vincent-Maladiere

Suggest a potential alternative/fix

No response

jeromedockes avatar Feb 07 '25 09:02 jeromedockes

In this case, I think that I would rather shuffle in the examples: this is a problem that is frequent in the actual applications and I would like 1) people to see it, 2) our code to still be elegant when having to shuffle.

GaelVaroquaux avatar Feb 07 '25 09:02 GaelVaroquaux

Real-world applications and datasets have many issues and aren't necessarily shuffled by default. If we want skrub to reflect not only toy examples like scikit-learn does, then I believe it's best to use the stratify parameter explicitly instead of having a perfect dataset.

Vincent-Maladiere avatar Feb 08 '25 12:02 Vincent-Maladiere

I think this issue can be split into two parts:

  1. A simple PR that modifies the example showing that there is the need to shuffle
  2. Possibly, a second PR that adds a "shuffle" parameter when fetching the dataset

rcap107 avatar Oct 10 '25 11:10 rcap107

I will work on this issue

LechlechLatifa avatar Oct 29 '25 10:10 LechlechLatifa

As discussed IRL, to address this issue we can't shuffle the dataset as part of the example: the dataset should be shuffled before loading it.

This means that the current version of the dataset should be downloaded, shuffled manually and then reuploaded on the dataset repo, i.e., it has to be done by a maintainer.

rcap107 avatar Nov 05 '25 17:11 rcap107