Unblock ProtoMPRS to control determinism of DataPipe in single/multi-processing and dist/non-dist env
This PR temporarily extend PrototypingMultiProcessingReadingService to fully control the determinism of the pipeline in the combinations of:
- Single/Multi-processing
- Distributed/Non-distributed
When we have
SequentialReadingServiceready to combineDistributedReadingServiceandPrototypingMultiProcessingReadingService, a few code should be removed. And, for in-process reading service, we still need a method to isolate global RNGs to prevent data-pipeline interferes randomness against model.
For multiprocessing case, it will set the same random seed for Shuffler and set different deterministic seeds for global RNGs including python.random, torch and numpy within each subprocess.
For distributed case, it will share the same random seed for Shuffler across all distributed subprocesses to guarantee the shuffle order before sharding.
Tests: All tests are executed in the combinations of the above environments
- [x] Validate the same seed will generate the same order of data
- [x] Validate different seeds will generate different order of data
- [x] Validate the data after shuffle and sharding in each worker are mutually exclusive and collectively exhaustive with/without manual seed
There is one missing test I will add tmrw
- [x] Validate subprocess-local RNGs like
random,torchandnumpyare properly set with different seeds.
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.