Megatron-LM [BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )

When I use Megatron.core to train a moe model, I got the following bugs :

Output Info : [rank2]:[E ProcessGroupNCCL.cpp:754] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600381 milliseconds before timing out. 9aee1c01f3e7:6727:7168 [1] NCCL INFO [Service thread] Connection closed by localRank 2 9aee1c01f3e7:6728:7169 [2] NCCL INFO [Service thread] Connection closed by localRank 2 9aee1c01f3e7:6729:7171 [3] NCCL INFO [Service thread] Connection closed by localRank 2 [rank0]:[E ProcessGroupNCCL.cpp:754] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600528 milliseconds before timing out. 9aee1c01f3e7:6727:7168 [1] NCCL INFO [Service thread] Connection closed by localRank 0 9aee1c01f3e7:6726:7170 [0] NCCL INFO [Service thread] Connection closed by localRank 0 9aee1c01f3e7:6729:7171 [3] NCCL INFO [Service thread] Connection closed by localRank 0 [rank1]:[E ProcessGroupNCCL.cpp:754] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600916 milliseconds before timing out. 9aee1c01f3e7:6726:7170 [0] NCCL INFO [Service thread] Connection closed by localRank 1 9aee1c01f3e7:6727:7168 [1] NCCL INFO [Service thread] Connection closed by localRank 1 9aee1c01f3e7:6728:7169 [2] NCCL INFO [Service thread] Connection closed by localRank 1 9aee1c01f3e7:6727:7025 [1] NCCL INFO comm 0x56490665efb0 rank 1 nranks 4 cudaDev 1 busId 5a000 - Abort COMPLETE [rank1]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:774] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1282] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600916 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f475399c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(c10::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f2 (0x7f46f5758142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x178 (0x7f46f575e538 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8e (0x7f46f575eb2e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f47534b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f475f2a2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f475f333814 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600916 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f475399c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(c10::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1f2 (0x7f46f5758142 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x178 (0x7f46f575e538 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x8e (0x7f46f575eb2e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7f47534b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f475f2a2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f475f333814 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1286 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7f475399c8f9 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0xf59d3e (0x7f46f5786d3e in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xc91879 (0x7f46f54be879 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xdc253 (0x7f47534b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x94ac3 (0x7f475f2a2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: clone + 0x44 (0x7f475f333814 in /usr/lib/x86_64-linux-gnu/libc.so.6)

9aee1c01f3e7:6726:8015 [0] NCCL INFO [Service thread] Connection closed by localRank 1 9aee1c01f3e7:6729:8029 [3] NCCL INFO [Service thread] Connection closed by localRank 0 9aee1c01f3e7:6726:8080 [0] NCCL INFO [Service thread] Connection closed by localRank 1 9aee1c01f3e7:6729:8090 [3] NCCL INFO [Service thread] Connection closed by localRank 0 [2024-03-14 05:21:58,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6726 closing signal SIGTERM [2024-03-14 05:21:58,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6728 closing signal SIGTERM [2024-03-14 05:21:58,498] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6729 closing signal SIGTERM [2024-03-14 05:21:58,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 6727) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.2.0a0+81ea7a4', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 351, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError

Environment (please complete the following information):