Megatron-LM [QUESTION]NCCL timeout error when running the second iteration

I use one machine and 4GPUs to run gpt3； the first iteration is runnning without any errors, but the second iteration makes errors , strucked by the second iteration and the second step, the erros as follows：

[iteration] datetime: 2024-09-13 07:04:42 [E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=33, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 607565 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=257, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608700 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1796, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608843 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.

have anyone met the same problem？ thanks a lot

Sep 13 '24 08:09 zmtttt

Yep same issue here, We were able to make the pipeline parallelism work for the value 2 on the same node but beyond 2 and in multi nodes settings, it doesn't work.

Sep 17 '24 15:09 jgcb00

Yep same issue here, We were able to make the pipeline parallelism work for the value 2 on the same node but beyond 2 and in multi nodes settings, it doesn't work.

I restarted the server and re-downloaded the model and dataset, and the issue was resolved.

Sep 18 '24 05:09 zmtttt

I pulled the master and it solved the issue

Sep 18 '24 14:09 jgcb00

Marking as stale. No activity in 60 days.

Nov 17 '24 18:11 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Jul 31 '25 02:07 github-actions[bot]