HugeCTR
HugeCTR copied to clipboard
[BUG] Random seed does not synchronize between nodes in Multi-Nodes Training.
Describe the bug Random seed does not synchronize between nodes in Multi-Nodes Training.
Code https://github.com/NVIDIA/HugeCTR/blob/62447dc2a2201e25184f8d74f3d39f300417cb13/HugeCTR/include/mmap_offset_list.hpp#L102
https://github.com/NVIDIA/HugeCTR/blob/36612f9ecca93248379d5fac6eb405a736a23547/HugeCTR/src/resource_manager.cpp#L30-L48
Screenshots
Device 0: Tesla V100-SXM2-16GB
[27d03h34m56s][HUGECTR][INFO]: Initial seed is 926002362
[27d03h34m56s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[27d03h34m56s][HUGECTR][INFO]: Vocabulary size: 187767399
[27d16h35m01s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: Tesla V100-SXM2-16GB
[27d16h35m01s][HUGECTR][INFO]: Initial seed is 3302328738
[27d16h35m01s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[27d16h35m01s][HUGECTR][INFO]: Vocabulary size: 187767399
[27d03h34m58s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: Tesla V100-SXM2-16GB
[27d03h34m58s][HUGECTR][INFO]: Initial seed is 3986389508
[27d03h34m58s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[27d03h34m58s][HUGECTR][INFO]: Vocabulary size: 187767399
[27d03h35m04s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Actually, In current implementation (we will modify it later), it is not a bug, HugeCTR doesn't need the seed to be synchronized. Models will be initialized with random seed on the first process and read by the others.
Will different random seed cause different shuffle result in DataReader and cause input data inconsistency?
https://github.com/NVIDIA/HugeCTR/blob/62447dc2a2201e25184f8d74f3d39f300417cb13/HugeCTR/include/mmap_offset_list.hpp#L100-L104
I get your point. Yes I think it's a bug. Thank you!
@minseokl I think seed sync is no longer an issue in the latest HugeCTR, but please help to confirm.
For both MmapOffsetList
and ResourceManager
, we do HCTR_MPI_THROW(MPI_Bcast(&seed, 1, MPI_UNSIGNED, 0, MPI_COMM_WORLD));
Thus, it is not an issue anymore. Let me close it.