data icon indicating copy to clipboard operation
data copied to clipboard

`in_batch_shuffle` currently doesn't use shared RNG

Open NivekT opened this issue 2 years ago • 0 comments

Currently, in_batch_shuffle doesn't use the RNG shared across processes. This can be problematic if it is used before prior to sharding, resulting in certain samples not being used in training while others may be used by multiple processes.

https://github.com/pytorch/data/blob/100d086413873a5c224842da4c2cd55cb634317f/torchdata/datapipes/iter/transform/bucketbatcher.py#L19-L20

Note that in_batch_shuffle is also used by bucketbatch.

https://github.com/pytorch/data/blob/100d086413873a5c224842da4c2cd55cb634317f/torchdata/datapipes/iter/transform/bucketbatcher.py#L50-L51

Moreover, we may want to provide the ability to set_seed for those two DataPipes to enable reproducibility.

NivekT avatar Jun 15 '22 20:06 NivekT