vision icon indicating copy to clipboard operation
vision copied to clipboard

Enable custom samplers for imbalanced datasets

Open PierreQuinton opened this issue 2 years ago • 3 comments

🚀 The feature

For each classification datasets with balanced distribution on the classes (MNIST, CIFAR-N, etc...), it would be very useful to provide a standard dataset for the imbalanced version of the dataset. For a dataset with $n$ classes, define the imbalance factor $a\in [0,1]$, then the proportion of class $i$ is typically be proportional to $a^{i/(n-1)}$, we need to normalize so that the proportions sums to $1$. For $a=1$ this is uniform and the smaller the imbalance coefficient the more imbalanced the dataset is.

I am not sure if torch vision should provide with the datasets or provide a data loader that imbalance the dataset.

Motivation, pitch

Many papers are published on the problem of training on an imbalanced dataset and testing on a balanced dataset, for instance see this. As far as I know, there is no systematic way of generating such data sets for people using Pytorch. Here are few very similar implementations that are not fully satisfying :

  • https://github.com/zhangyongshun/BagofTricks-LT/blob/main/lib/dataset/cao_cifar.py
  • https://github.com/KaihuaTang/Long-Tailed-Recognition.pytorch/blob/master/classification/data/ImbalanceCIFAR.py

Such datasets seems to exist on TensorFlow, for instance section 3 of the readme of this repo provides with links to download tfrecord datasets.

I feels like it could be a very nice feature of torchvision to either contain such datasets or be able to craft them easily.

Alternatives

No response

Additional context

No response

cc @pmeier

PierreQuinton avatar Nov 05 '23 14:11 PierreQuinton

Hi @PierreQuinton ,

It seems like what you need is a custom Sampler. IIUC, https://github.com/ufoym/imbalanced-dataset-sampler should be pretty close to what you're looking for?

NicolasHug avatar Nov 09 '23 17:11 NicolasHug

@NicolasHug Thanks for your answer, yes this is exactly what I am looking for. I'm not sure if you would like to add something similar to torch or if you would close the issue, I leave it up to you.

PierreQuinton avatar Nov 14 '23 07:11 PierreQuinton

Thanks @PierreQuinton . I'll keep the issue open and rename it for clarity. Ultimately, what is needed to enable that is:

    1. a consistent interface across datasets to access the idx -> classes mapping
    1. a custom sampler (probably need a distributed one as well)

i) is definitely in scope for torchvision and this is something we'd be doing if we ever re-start our work on a dataset revamp (CC @pmeier ). For ii), we can decide when the time comes, but I don't see why not

NicolasHug avatar Nov 14 '23 11:11 NicolasHug