data icon indicating copy to clipboard operation
data copied to clipboard

ParallelMapper missing a `worker_init_fn`

Open alanhdu opened this issue 7 months ago • 1 comments

🚀 The feature

Add worker_init_fn support for ParallelMapper (and maybe persistent_workers).

Motivation, pitch

Right now, there is no way to specify a custom worker_init_fn for the parallel mapping fur customizing the startup process. This means we cannot use ParallelMapper with process mode (since we often need to configure credentials, loggers, random seeds, etc).

We could also consider adding a flag for persistent_workers to avoid spinning up new processes on each epoch (or, if this is already implemented, make it clear in the https://docs.pytorch.org/data/main/migrate_to_nodes_from_utils.html#map-style-datasets section that this is the case), which can help avoid wasting time re-initializing the process.

Alternatives

No response

Additional context

I think it'd be helpful for the docs to also talk a bit about the relationship with torch.utils.data.get_worker_info -- I think that function is pretty commonly used for random seeding across workers, but it sounds like it might work with nodes?

alanhdu avatar Jun 04 '25 20:06 alanhdu

Thanks for the issue! Both valid points - we can look at adding worker_init_fn, persistent_workers might need more thought.

divyanshk avatar Jun 04 '25 20:06 divyanshk