HugeCTR icon indicating copy to clipboard operation
HugeCTR copied to clipboard

[BUG] Random seed does not synchronize between nodes in Multi-Nodes Training.

Open Kur0x opened this issue 3 years ago • 4 comments

Describe the bug Random seed does not synchronize between nodes in Multi-Nodes Training.

Code https://github.com/NVIDIA/HugeCTR/blob/62447dc2a2201e25184f8d74f3d39f300417cb13/HugeCTR/include/mmap_offset_list.hpp#L102

https://github.com/NVIDIA/HugeCTR/blob/36612f9ecca93248379d5fac6eb405a736a23547/HugeCTR/src/resource_manager.cpp#L30-L48

Screenshots

Device 0: Tesla V100-SXM2-16GB
[27d03h34m56s][HUGECTR][INFO]: Initial seed is 926002362
[27d03h34m56s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[27d03h34m56s][HUGECTR][INFO]: Vocabulary size: 187767399
[27d16h35m01s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: Tesla V100-SXM2-16GB
[27d16h35m01s][HUGECTR][INFO]: Initial seed is 3302328738
[27d16h35m01s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[27d16h35m01s][HUGECTR][INFO]: Vocabulary size: 187767399
[27d03h34m58s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.
Device 0: Tesla V100-SXM2-16GB
[27d03h34m58s][HUGECTR][INFO]: Initial seed is 3986389508
[27d03h34m58s][HUGECTR][INFO]: cache_eval_data is not specified using default: 0
[27d03h34m58s][HUGECTR][INFO]: Vocabulary size: 187767399
[27d03h35m04s][HUGECTR][INFO]: Peer-to-peer access cannot be fully enabled.

Kur0x avatar Nov 27 '20 08:11 Kur0x

Actually, In current implementation (we will modify it later), it is not a bug, HugeCTR doesn't need the seed to be synchronized. Models will be initialized with random seed on the first process and read by the others.

zehuanw avatar Nov 30 '20 08:11 zehuanw

Will different random seed cause different shuffle result in DataReader and cause input data inconsistency?

https://github.com/NVIDIA/HugeCTR/blob/62447dc2a2201e25184f8d74f3d39f300417cb13/HugeCTR/include/mmap_offset_list.hpp#L100-L104

Kur0x avatar Nov 30 '20 08:11 Kur0x

I get your point. Yes I think it's a bug. Thank you!

zehuanw avatar Dec 01 '20 14:12 zehuanw

@minseokl I think seed sync is no longer an issue in the latest HugeCTR, but please help to confirm.

zehuanw avatar May 02 '22 03:05 zehuanw

For both MmapOffsetList and ResourceManager, we do HCTR_MPI_THROW(MPI_Bcast(&seed, 1, MPI_UNSIGNED, 0, MPI_COMM_WORLD)); Thus, it is not an issue anymore. Let me close it.

minseokl avatar Nov 23 '22 03:11 minseokl