datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Add repeat() for iterable datasets

Open alex-hh opened this issue 1 year ago • 2 comments

Feature request

It would be useful to be able to straightforwardly repeat iterable datasets indefinitely, to provide complete control over starting and ending of iteration to the user.

An IterableDataset.repeat(n) function could do this automatically

Motivation

This feature was discussed in this issue https://github.com/huggingface/datasets/issues/7147, and would resolve the need to use the hack of interleave datasets with probability 0 as a simple way to achieve this functionality.

An additional benefit might be the simplification of the use of iterable datasets in a distributed setting: If the user can assume that datasets will repeat indefinitely, then issues around different numbers of samples appearing on different devices (e.g. https://github.com/huggingface/datasets/issues/6437, https://github.com/huggingface/datasets/issues/6594, https://github.com/huggingface/datasets/issues/6623, https://github.com/huggingface/datasets/issues/6719) can potentially be straightforwardly resolved by simply doing:

ids.repeat(None).take(n_samples_per_epoch)

Your contribution

I'm not familiar enough with the codebase to assess how straightforward this would be to implement.

If it might be very straightforward, I could possibly have a go.

alex-hh avatar Oct 02 '24 17:10 alex-hh

perhaps concatenate_datasets can already be used to achieve almost the same effect?

alex-hh avatar Oct 03 '24 09:10 alex-hh

concatenate_datasets does the job when there is a finite number of repetitions, but in case of .repeat() forever we need a new logic in iterable_dataset.py

lhoestq avatar Oct 03 '24 12:10 lhoestq

done in https://github.com/huggingface/datasets/pull/7198

lhoestq avatar Mar 18 '25 10:03 lhoestq