datasets icon indicating copy to clipboard operation
datasets copied to clipboard

[interleave_dataset] sample batches from a single source at a time

Open memray opened this issue 6 months ago • 0 comments

Feature request

interleave_dataset and RandomlyCyclingMultiSourcesExamplesIterable enable us to sample data examples from different sources. But can we also sample batches in a similar manner (each batch only contains data from a single source)?

Motivation

Some recent research [1, 2] shows that source homogenous batching can be helpful for contrastive learning. Can we add a function called RandomlyCyclingMultiSourcesBatchesIterable to support this functionality?

Your contribution

I can contribute a PR. But I wonder what the best way is to test its correctness and robustness.

memray avatar Aug 23 '24 07:08 memray