data
data copied to clipboard
`in_batch_shuffle` currently doesn't use shared RNG
Currently, in_batch_shuffle
doesn't use the RNG shared across processes. This can be problematic if it is used before prior to sharding, resulting in certain samples not being used in training while others may be used by multiple processes.
https://github.com/pytorch/data/blob/100d086413873a5c224842da4c2cd55cb634317f/torchdata/datapipes/iter/transform/bucketbatcher.py#L19-L20
Note that in_batch_shuffle
is also used by bucketbatch
.
https://github.com/pytorch/data/blob/100d086413873a5c224842da4c2cd55cb634317f/torchdata/datapipes/iter/transform/bucketbatcher.py#L50-L51
Moreover, we may want to provide the ability to set_seed
for those two DataPipes to enable reproducibility.