LMOps Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO

Hi, I'm using ZeRO with optimizer and parameter offload to run minillm on 2 H100 gpus on a single node. After doing the generation evaluation, I get a timeout during the all_gather step.

Generation Evaluation: 100%|█████████▉| 497/499 [18:29:58<05:20, 160.10s/it][E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134630, OpType=ALLGATHER, NumelIn=499, NumelOut=998, Timeout(ms)=18000000) ran for 18000109 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22134629, OpType=_ALLGATHER_BASE, NumelIn=65536000, NumelOut=131072000, Timeout(ms)=18000000) ran for 18000929 milliseconds before timing out.

I've tried increasing the timeout period without success. Are there any other configurations or steps I can take to resolve this timeout issue?

Thank you for your help!

Dec 12 '23 11:12 Ispanicus

Have you tried A100s or V100s? I am unsure whether the above error only appears with H100s.

Dec 12 '23 12:12 donglixp

I unfortunately only have access to 2 H100s. It could be an issue, since they run on cuda sm_90, but I wouldn't know where to begin to debug that.

Dec 12 '23 13:12 Ispanicus

LMOps LMOps copied to clipboard

Timeout Error in all_gather during evaluate_ppo() on 2 H100 GPUs with miniLLM and ZeRO

LMOps
LMOps copied to clipboard