all_gather timeout
1%|▏ | 4817/360910 [38:19<22:40:36, 4.36it/s][rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=48128, NumelOut=192512, Timeout(ms)=1800000) ran for 1800644 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=49152, NumelOut=196608, Timeout(ms)=1800000) ran for 1800838 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6029, OpType=ALLREDUCE, NumelIn=65084417, NumelOut=65084417, Timeout(ms)=600000) ran for 1801622 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.
造成这种问题的原因是什么呢?我通过设置 --negatives_cross_device False 可以避免吗? negatives_cross_device False 会造成很大的模型效果损失吗?有没有更好的建议,比如在哪里设置negatives 的最大值以规避超时?
Sorry, we cannot determine the cause of this issue based on provided information. You can try to run the code several more times to see if it can successfully run to completion.
have you sloved ? I met the same problem