datasets
datasets copied to clipboard
Support multi-worker with streaming dataset (IterableDataset).
Is your feature request related to a problem? Please describe.
The current .map
does not support multi-process, CPU can become bottleneck if the pre-processing is complex (e.g. t5 span masking).
Describe the solution you'd like
Ideally .map
should support multi-worker like tfds, with AUTOTUNE
.
Describe alternatives you've considered A simpler solution is to shard the dataset and process it in parallel with pytorch dataloader. The shard does not need to be of equal size.
- https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset
Additional context
Hi ! This is a great idea :)
I think we could have something similar to what we have in datasets.Dataset.map
, i.e. a num_proc
parameter that tells how many processes to spawn to parallelize the data processing.
Regarding AUTOTUNE, this could be a nice feature as well, we could see how to add it in a second step
Any update on this feature request?
Not yet, I'm happy to provide some guidance if someone wants to give it a shot though.
The code that applies the map
function is in iterable_dataset.py
, in MappedExamplesIterable.__iter__