Nccl timeout occurs in a multi-GPUs task.

Open QingshuiL opened this issue 10 months ago • 0 comments

When I perform the multi-GPU evaluation for DeepSeekv3, the following error message is displayed:

2025-02-21 16:29:40.579 | INFO | llmc.eval.eval_base:init:21 - eval_cfg : {'eval_pos': ['pretrain', 'transformed', 'fake_quant'], 'name': 'wikitext2', 'download': True, 'path': 'eval data pat h', 'seq_len': 2048, 'bs': 2, 'inference_per_block': True, 'type': 'ppl'} Token indices sequence length is longer than the specified maximum sequence length for this model (288925 > 131072). Running this sequence through the model will result in indexing errors 2025-02-21 16:29:59.070 | INFO | llmc.eval.eval_ppl:eval_func:25 - index : 0/70 [rank1]:[E221 16:42:45.196832485 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran f or 600008 milliseconds before timing out. [rank1]:[E221 16:42:45.206053793 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 2, last enqueued NCCL work: 2, last complet ed NCCL work: 1. [rank2]:[E221 16:42:45.303973140 ProcessGroupNCCL.cpp:607] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran f or 600055 milliseconds before timing out. [rank2]:[E221 16:42:45.304506239 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 2, last enqueued NCCL work: 2, last complet ed NCCL work: 1. [rank2]:[E221 16:42:50.489321169 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 2] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1. [rank2]:[E221 16:42:50.489359169 ProcessGroupNCCL.cpp:621] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E221 16:42:50.489367329 ProcessGroupNCCL.cpp:627] [Rank 2] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E221 16:42:50.495930731 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb829e81f86 in

frame #5: + 0x8609 (0x7fb87e24c609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fb87e017353 in /lib/x86_64-linux-gnu/libc.so.6)

But I can run with a single GPU. How can I solve this NCCL time out problem?

Feb 21 '25 08:02 QingshuiL