FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

单卡训练可以,但是多卡训练报NCLL超时

Open jacksonlee02365894 opened this issue 11 months ago • 2 comments
trafficstars

同样的数据集,单卡训练都是正常的,但是多卡训练的时候,会报错。 已经尝试过逐步减小batch_size,并没有解决问题。 [rank0]:[E ProcessGroupNCCL.cpp:563] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3437, OpType=ALLREDUCE, NumelIn=4311252, NumelOut=4311252, Timeout(ms)=600000) ran for 600057 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3437, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [2024-12-18 16:44:15,305][root][INFO] - Validate epoch: 1, rank: 1

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 3437, last enqueued NCCL work: 3437, last completed NCCL work: 3436. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3437, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2a60b7a897 in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f2a148651b2 in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f2a14869fd0 in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f2a1486b31c in /home/data/miniconda3/envs/asr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f2a602c7bf4 in /home/data/miniconda3/envs/asr/bin/../lib/libstdc++.so.6) frame #5: + 0x94ac3 (0x7f2a61647ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: clone + 0x44 (0x7f2a616d8a04 in /lib/x86_64-linux-gnu/libc.so.6)

jacksonlee02365894 avatar Dec 18 '24 08:12 jacksonlee02365894