Megatron-LM
Megatron-LM copied to clipboard
[BUG] NCCL TIMEOUT ( maybe ALLREDUCE ? )
When I use Megatron.core to train a moe model, I got the following bugs :
Output Info :
[rank2]:[E ProcessGroupNCCL.cpp:754] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600381 milliseconds before timing out.
9aee1c01f3e7:6727:7168 [1] NCCL INFO [Service thread] Connection closed by localRank 2
9aee1c01f3e7:6728:7169 [2] NCCL INFO [Service thread] Connection closed by localRank 2
9aee1c01f3e7:6729:7171 [3] NCCL INFO [Service thread] Connection closed by localRank 2
[rank0]:[E ProcessGroupNCCL.cpp:754] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600528 milliseconds before timing out.
9aee1c01f3e7:6727:7168 [1] NCCL INFO [Service thread] Connection closed by localRank 0
9aee1c01f3e7:6726:7170 [0] NCCL INFO [Service thread] Connection closed by localRank 0
9aee1c01f3e7:6729:7171 [3] NCCL INFO [Service thread] Connection closed by localRank 0
[rank1]:[E ProcessGroupNCCL.cpp:754] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600916 milliseconds before timing out.
9aee1c01f3e7:6726:7170 [0] NCCL INFO [Service thread] Connection closed by localRank 1
9aee1c01f3e7:6727:7168 [1] NCCL INFO [Service thread] Connection closed by localRank 1
9aee1c01f3e7:6728:7169 [2] NCCL INFO [Service thread] Connection closed by localRank 1
9aee1c01f3e7:6727:7025 [1] NCCL INFO comm 0x56490665efb0 rank 1 nranks 4 cudaDev 1 busId 5a000 - Abort COMPLETE
[rank1]:[E ProcessGroupNCCL.cpp:768] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:774] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1282] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600916 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=305, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600916 milliseconds before timing out.
Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:756 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits
Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1286 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits
9aee1c01f3e7:6726:8015 [0] NCCL INFO [Service thread] Connection closed by localRank 1
9aee1c01f3e7:6729:8029 [3] NCCL INFO [Service thread] Connection closed by localRank 0
9aee1c01f3e7:6726:8080 [0] NCCL INFO [Service thread] Connection closed by localRank 1
9aee1c01f3e7:6729:8090 [3] NCCL INFO [Service thread] Connection closed by localRank 0
[2024-03-14 05:21:58,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6726 closing signal SIGTERM
[2024-03-14 05:21:58,497] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6728 closing signal SIGTERM
[2024-03-14 05:21:58,498] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 6729 closing signal SIGTERM
[2024-03-14 05:21:58,663] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 6727) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in
Environment (please complete the following information):
- Megatron-LM 8957468
- PyTorch version 3.10.12
- CUDA version 12.3
- NCCL : follow cuda
If you can answer my question, I would be very grateful.
Can confirm the same issue
Marking as stale. No activity in 60 days.
I have also encountered this problem. May I ask if this problem has been resolved and how it was resolved
I have also encountered this problem, too.
same issue, any update? How can we obtain more detailed debug info?
My bug has been solve in this PR https://github.com/NVIDIA/TransformerEngine/pull/1031. Hope this helps you.
Marking as stale. No activity in 60 days.
Marking as stale. No activity in 60 days.