FlagEmbedding icon indicating copy to clipboard operation
FlagEmbedding copied to clipboard

all_gather timeout

Open trillionmonster opened this issue 1 year ago • 2 comments

1%|▏ | 4817/360910 [38:19<22:40:36, 4.36it/s][rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=48128, NumelOut=192512, Timeout(ms)=1800000) ran for 1800644 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6207, OpType=ALLGATHER, NumelIn=49152, NumelOut=196608, Timeout(ms)=1800000) ran for 1800838 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6029, OpType=ALLREDUCE, NumelIn=65084417, NumelOut=65084417, Timeout(ms)=600000) ran for 1801622 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out. [rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6206, OpType=ALLGATHER, NumelIn=8192, NumelOut=32768, Timeout(ms)=1800000) ran for 1801713 milliseconds before timing out.

造成这种问题的原因是什么呢?我通过设置 --negatives_cross_device False 可以避免吗? negatives_cross_device False 会造成很大的模型效果损失吗?有没有更好的建议,比如在哪里设置negatives 的最大值以规避超时?

trillionmonster avatar May 11 '24 03:05 trillionmonster

Sorry, we cannot determine the cause of this issue based on provided information. You can try to run the code several more times to see if it can successfully run to completion.

staoxiao avatar May 11 '24 09:05 staoxiao

have you sloved ? I met the same problem

zmtttt avatar Aug 29 '24 06:08 zmtttt