ChatLM-mini-Chinese icon indicating copy to clipboard operation
ChatLM-mini-Chinese copied to clipboard

Some NCCL operations have failed or timed out.

Open dbcSep03 opened this issue 10 months ago • 6 comments

rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=61, OpType=_ALLGATHER_BASE, NumelIn=7168, NumelOut=14336, Timeout(ms)=1800000) ran for 1800086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7e2ae5781d87 in /home/dongbingcheng/anaconda3/envs/llmfinetuning/lib/python3.9/site-packages/torch/lib/libc10.so)

我是双卡训练,感觉是训练完第一个epoch就出现这个错误 我使用的是实现的train.py文件 感觉是不是评估的时候,前面进程没结束 添加个accelerator.wait_for_everyone() 谢谢解答!

dbcSep03 avatar Apr 18 '24 04:04 dbcSep03