datasets Add repeat() for iterable datasets

Add repeat() for iterable datasets

Open alex-hh opened this issue 1 year ago • 2 comments

Feature request

It would be useful to be able to straightforwardly repeat iterable datasets indefinitely, to provide complete control over starting and ending of iteration to the user.

An IterableDataset.repeat(n) function could do this automatically

Motivation

This feature was discussed in this issue https://github.com/huggingface/datasets/issues/7147, and would resolve the need to use the hack of interleave datasets with probability 0 as a simple way to achieve this functionality.

An additional benefit might be the simplification of the use of iterable datasets in a distributed setting: If the user can assume that datasets will repeat indefinitely, then issues around different numbers of samples appearing on different devices (e.g. https://github.com/huggingface/datasets/issues/6437, https://github.com/huggingface/datasets/issues/6594, https://github.com/huggingface/datasets/issues/6623, https://github.com/huggingface/datasets/issues/6719) can potentially be straightforwardly resolved by simply doing:

ids.repeat(None).take(n_samples_per_epoch)

Your contribution

I'm not familiar enough with the codebase to assess how straightforward this would be to implement.

If it might be very straightforward, I could possibly have a go.

Oct 02 '24 17:10 alex-hh

perhaps concatenate_datasets can already be used to achieve almost the same effect?

Oct 03 '24 09:10 alex-hh

concatenate_datasets does the job when there is a finite number of repetitions, but in case of .repeat() forever we need a new logic in iterable_dataset.py

Oct 03 '24 12:10 lhoestq

done in https://github.com/huggingface/datasets/pull/7198

Mar 18 '25 10:03 lhoestq

datasets datasets copied to clipboard

Add repeat() for iterable datasets

Feature request

Motivation

Your contribution

datasets
datasets copied to clipboard