data Add support for sharding filter in distributed settings

🚀 The feature

Implement a distributed_sharding_filter that would behave similar to sharding_filter (https://github.com/pytorch/pytorch/blob/3f140c5b32fa8685cc7a10bdb94f3f8b127e3a92/torch/utils/data/datapipes/iter/grouping.py), but would filter according to global rank and world size if torch.distributed was initialized. If torch.distributed was not initialized, this filter would do nothing.

Motivation, pitch

I am running a distributed training, and I had to write this filter myself. I think it is required in most distributed training scenarios and it would make a lot of sense to add this into the library.

Alternatives

An alternative implementation would extend the sharding_filter already present in PyTorch core library. However, this could break backward compatibility?

Additional context

No response

Apr 12 '22 08:04 jkulhanek

Thanks for asking it. We understand this need and we are working on it. We are currently working on DataLoader2 to handle dynamic sharding using the sharding_filter for both MP / Distributed scenario.

Apr 12 '22 14:04 ejguan

Should I close this issue or should I link a feature request from the PyTorch repository?

Apr 12 '22 15:04 jkulhanek

No need to close it. We will keep you updated when this feature is landed.

Apr 12 '22 15:04 ejguan

Just want to update here. If you have sharding_filter in your pipeline, the DataPipe graph should be dynamically sharded using DataLoader. You can either use nightly release of PyTorch Core and TorchData or wait for a weeks for the coming official release. Please take a look at tutorial

And, we will keep working on DataLoader2 for better support of parallelism execution and other features like snapshotting. Please stay tuned.

Jun 10 '22 20:06 ejguan

It's done in both DistributedReadingService and PrototypingMultiprocessingReadingService. Closing now

Oct 20 '22 20:10 ejguan