[QUESTION]NCCL timeout error when running the second iteration
I use one machine and 4GPUs to run gpt3; the first iteration is runnning without any errors, but the second iteration makes errors , strucked by the second iteration and the second step, the erros as follows:
[iteration] datetime: 2024-09-13 07:04:42 [E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=33, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 607565 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=257, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608700 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1796, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608843 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1032, OpType=ALLREDUCE, Timeout(ms)=600000) ran for 608832 milliseconds before timing out.
have anyone met the same problem? thanks a lot
Yep same issue here, We were able to make the pipeline parallelism work for the value 2 on the same node but beyond 2 and in multi nodes settings, it doesn't work.
Yep same issue here, We were able to make the pipeline parallelism work for the value 2 on the same node but beyond 2 and in multi nodes settings, it doesn't work.
I restarted the server and re-downloaded the model and dataset, and the issue was resolved.
I pulled the master and it solved the issue
Marking as stale. No activity in 60 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.