datasets Support of num_workers (multiprocessing) in map for IterableDataset

Support of num_workers (multiprocessing) in map for IterableDataset

Open getao opened this issue 1 year ago • 1 comments

Feature request

Currently, IterableDataset doesn't support setting num_worker in .map(), which results in slow processing here. Could we add support for it? As .map() can be run in the batch fashion (e.g., batch_size is default to 1000 in datasets), it seems to be doable for IterableDataset as the regular Dataset.

Motivation

Improving data processing efficiency

Your contribution

Testing

Oct 02 '24 18:10 getao

I was curious about the same - since map is applied on the fly I was assuming that setting num_workers>1 in DataLoader would effectively do the map in parallel, have you tried that?

Oct 03 '24 09:10 alex-hh

datasets datasets copied to clipboard

Support of num_workers (multiprocessing) in map for IterableDataset

Feature request

Motivation

Your contribution

datasets
datasets copied to clipboard