ParallelMapper missing a `worker_init_fn`
🚀 The feature
Add worker_init_fn support for ParallelMapper (and maybe persistent_workers).
Motivation, pitch
Right now, there is no way to specify a custom worker_init_fn for the parallel mapping fur customizing the startup process. This means we cannot use ParallelMapper with process mode (since we often need to configure credentials, loggers, random seeds, etc).
We could also consider adding a flag for persistent_workers to avoid spinning up new processes on each epoch (or, if this is already implemented, make it clear in the https://docs.pytorch.org/data/main/migrate_to_nodes_from_utils.html#map-style-datasets section that this is the case), which can help avoid wasting time re-initializing the process.
Alternatives
No response
Additional context
I think it'd be helpful for the docs to also talk a bit about the relationship with torch.utils.data.get_worker_info -- I think that function is pretty commonly used for random seeding across workers, but it sounds like it might work with nodes?
Thanks for the issue! Both valid points - we can look at adding worker_init_fn, persistent_workers might need more thought.