Megatron-LM terminate called after throwing an instance of 'c10::DistBackendError'

terminate called after throwing an instance of 'c10::DistBackendError'

Open wccccp opened this issue 7 months ago • 2 comments

frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fe76ab98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fe705ea04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fe705ea81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fe705ea9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fe76a6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fe7765d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fe77666aa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 1 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199, OpType=_REDUCE_SCATTER_BASE, NumelIn=7098994688, NumelOut=443687168, Timeout(ms)=600000) ran for 882725 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:567 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fe76ab98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fe705ea04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fe705ea81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fe705ea9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fe76a6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fe7765d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fe77666aa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1450 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fe76ab98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0x10485fe (0x7fe705ed05fe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xcbc925 (0x7fe705b44925 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xdc253 (0x7fe76a6b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x94ac3 (0x7fe7765d9ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: clone + 0x44 (0x7fe77666aa04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E719 09:25:40.733826925 ProcessGroupNCCL.cpp:1606] [PG 1 Rank 6] Timeout at NCCL work: 199, last enqueued NCCL work: 199, last completed NCCL work: 198. [rank6]:[E719 09:25:40.733860630 ProcessGroupNCCL.cpp:579] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank6]:[E719 09:25:40.733868816 ProcessGroupNCCL.cpp:585] [Rank 6] To avoid data inconsistency, we are taking the entire process down. [rank6]:[E719 09:25:40.734423949 ProcessGroupNCCL.cpp:1446] [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199, OpType=_REDUCE_SCATTER_BASE, NumelIn=7098994688, NumelOut=443687168, Timeout(ms)=600000) ran for 882722 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:567 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fb635d98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fb5d10a04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fb5d10a81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fb5d10a9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fb6358b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fb641802ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fb641893a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 1 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199, OpType=_REDUCE_SCATTER_BASE, NumelIn=7098994688, NumelOut=443687168, Timeout(ms)=600000) ran for 882722 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:567 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fb635d98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fb5d10a04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fb5d10a81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fb5d10a9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fb6358b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fb641802ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fb641893a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1450 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fb635d98969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0x10485fe (0x7fb5d10d05fe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xcbc925 (0x7fb5d0d44925 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xdc253 (0x7fb6358b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x94ac3 (0x7fb641802ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: clone + 0x44 (0x7fb641893a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

timaker-8b8djlxvfe-master-0:156:1120 [7] NCCL INFO comm 0x562232115d50 rank 7 nranks 16 cudaDev 7 busId d0000 - Abort COMPLETE [rank7]:[E719 09:25:40.740359294 ProcessGroupNCCL.cpp:1606] [PG 1 Rank 7] Timeout at NCCL work: 199, last enqueued NCCL work: 199, last completed NCCL work: 198. [rank7]:[E719 09:25:40.740413970 ProcessGroupNCCL.cpp:579] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank7]:[E719 09:25:40.740434350 ProcessGroupNCCL.cpp:585] [Rank 7] To avoid data inconsistency, we are taking the entire process down. [rank7]:[E719 09:25:40.740575966 ProcessGroupNCCL.cpp:1446] [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199, OpType=_REDUCE_SCATTER_BASE, NumelIn=7098994688, NumelOut=443687168, Timeout(ms)=600000) ran for 882725 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:567 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7ff9a0afb969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7ff93bea04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7ff93bea81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7ff93bea9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7ff9a06b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7ff9ac55dac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7ff9ac5eea04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 1 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199, OpType=_REDUCE_SCATTER_BASE, NumelIn=7098994688, NumelOut=443687168, Timeout(ms)=600000) ran for 882725 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:567 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7ff9a0afb969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7ff93bea04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7ff93bea81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7ff93bea9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7ff9a06b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7ff9ac55dac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7ff9ac5eea04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1450 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7ff9a0afb969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: + 0x10485fe (0x7ff93bed05fe in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xcbc925 (0x7ff93bb44925 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: + 0xdc253 (0x7ff9a06b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: + 0x94ac3 (0x7ff9ac55dac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #5: clone + 0x44 (0x7ff9ac5eea04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

timaker-8b8djlxvfe-master-0:154:1112 [5] NCCL INFO comm 0x563b1cff6180 rank 5 nranks 16 cudaDev 5 busId 98000 - Abort COMPLETE timaker-8b8djlxvfe-master-0:152:1125 [3] NCCL INFO comm 0x55c0f86c8470 rank 3 nranks 16 cudaDev 3 busId 50000 - Abort COMPLETE timaker-8b8djlxvfe-master-0:151:1143 [2] NCCL INFO comm 0x563f7888ec70 rank 2 nranks 16 cudaDev 2 busId 4a000 - Abort COMPLETE [rank5]:[E719 09:25:40.746336670 ProcessGroupNCCL.cpp:1606] [PG 1 Rank 5] Timeout at NCCL work: 199, last enqueued NCCL work: 199, last completed NCCL work: 198. [rank5]:[E719 09:25:40.746363262 ProcessGroupNCCL.cpp:579] [Rank 5] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank5]:[E719 09:25:40.746373402 ProcessGroupNCCL.cpp:585] [Rank 5] To avoid data inconsistency, we are taking the entire process down. [rank5]:[E719 09:25:40.746461013 ProcessGroupNCCL.cpp:1446] [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199, OpType=_REDUCE_SCATTER_BASE, NumelIn=7098994688, NumelOut=443687168, Timeout(ms)=600000) ran for 882737 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:567 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fb2d6398969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fb2716a04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fb2716a81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fb2716a9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fb2d5eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fb2e1e42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fb2e1ed3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 1 Rank 5] Process group watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=199, OpType=_REDUCE_SCATTER_BASE, NumelIn=7098994688, NumelOut=443687168, Timeout(ms)=600000) ran for 882737 milliseconds before timing out. Exception raised from checkTimeout at /opt/pytorch/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:567 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x99 (0x7fb2d6398969 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e1 (0x7fb2716a04e1 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x222 (0x7fb2716a81a2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10f (0x7fb2716a9a0f in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdc253 (0x7fb2d5eb0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: + 0x94ac3 (0x7fb2e1e42ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7fb2e1ed3a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Jul 19 '24 09:07 wccccp

Megatron-LM Megatron-LM copied to clipboard

terminate called after throwing an instance of 'c10::DistBackendError'

Megatron-LM
Megatron-LM copied to clipboard