datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Support multi-worker with streaming dataset (IterableDataset).

Open cccntu opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. The current .map does not support multi-process, CPU can become bottleneck if the pre-processing is complex (e.g. t5 span masking).

Describe the solution you'd like Ideally .map should support multi-worker like tfds, with AUTOTUNE.

Describe alternatives you've considered A simpler solution is to shard the dataset and process it in parallel with pytorch dataloader. The shard does not need to be of equal size.

  • https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset

Additional context

cccntu avatar Jul 14 '21 08:07 cccntu

Hi ! This is a great idea :) I think we could have something similar to what we have in datasets.Dataset.map, i.e. a num_proc parameter that tells how many processes to spawn to parallelize the data processing.

Regarding AUTOTUNE, this could be a nice feature as well, we could see how to add it in a second step

lhoestq avatar Jul 15 '21 09:07 lhoestq

Any update on this feature request?

memray avatar May 03 '24 07:05 memray

Not yet, I'm happy to provide some guidance if someone wants to give it a shot though.

The code that applies the map function is in iterable_dataset.py, in MappedExamplesIterable.__iter__

lhoestq avatar May 03 '24 10:05 lhoestq