datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Support of num_workers (multiprocessing) in map for IterableDataset

Open getao opened this issue 1 year ago • 1 comments

Feature request

Currently, IterableDataset doesn't support setting num_worker in .map(), which results in slow processing here. Could we add support for it? As .map() can be run in the batch fashion (e.g., batch_size is default to 1000 in datasets), it seems to be doable for IterableDataset as the regular Dataset.

Motivation

Improving data processing efficiency

Your contribution

Testing

getao avatar Oct 02 '24 18:10 getao

I was curious about the same - since map is applied on the fly I was assuming that setting num_workers>1 in DataLoader would effectively do the map in parallel, have you tried that?

alex-hh avatar Oct 03 '24 09:10 alex-hh