axolotl icon indicating copy to clipboard operation
axolotl copied to clipboard

[Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out.

Open xiechengmude opened this issue 1 year ago • 9 comments

Please check that this issue hasn't been reported before.

  • [X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Training in Mixtral model

^M 13%|█▎ | 3004/23865 [5:58:04<41:19:58, 7.13s/it][E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=419044352, NumelOut=419044352, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=419044352, NumelOut=419044352, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20880, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800696 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800435 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20880, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800529 milliseconds before timing out.

Current behaviour

Training process broken at 13%

13%|█▎ | 3004/23865 [5:58:04<41:19:58,

Steps to reproduce

Traing qlora with params of { rank=128 alpha=64 FA2=True with layernorm and xentropy }

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

  • [X] Linux
  • [ ] macOS
  • [ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • [X] My issue title is concise, descriptive, and in title casing.
  • [X] I have searched the existing issues to make sure this bug has not been reported yet.
  • [X] I am using the latest version of axolotl.
  • [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

xiechengmude avatar Dec 16 '23 09:12 xiechengmude

I got a similar error doing a finetune of mixtral:

[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8378, OpType=REDUCE, NumelIn=100700160, NumelOut=100700160, Timeout(ms)=1800000) ran for 1800848 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8378, OpType=REDUCE, NumelIn=100700160, NumelOut=100700160, Timeout(ms)=1800000) ran for 1800848 milliseconds before timing out. [2023-12-17 03:23:16,548] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1598 closing signal SIGTERM [2023-12-17 03:23:17,983] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 1599) of binary: /home/vince/miniconda3/envs/distil/bin/python

Using deepspeed

amazingvince avatar Dec 17 '23 13:12 amazingvince

Could you see if there's some earlier errors?

NanoCode012 avatar Dec 18 '23 05:12 NanoCode012

there is not an error before that. Basically it would train fine up to 40 steps then say it timed out. I think its something to do with comunication between gpus. Accelerate did this in the last release: "We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error."

I tried a few diffrent permutations of pytorch 2+ and diffrent versions of accelerate. Did not change anything. Training here was done with deepspeed with cpu offloading of optimizer. Training on 2 4090s.

amazingvince avatar Dec 19 '23 02:12 amazingvince

Is this on runpod? Could you try the nccl doc? https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md

NanoCode012 avatar Dec 19 '23 08:12 NanoCode012

Nope its on lambda lab

| | xdan_dev | | @.*** |

---- Replied Message ---- | From | @.> | | Date | 12/19/2023 16:41 | | To | @.> | | Cc | @.> , @.> | | Subject | Re: [OpenAccess-AI-Collective/axolotl] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. (Issue #967) |

Is this on runpod? Could you try the nccl doc? https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

xiechengmude avatar Dec 19 '23 08:12 xiechengmude

Could you give that doc a try either way?

NanoCode012 avatar Dec 19 '23 08:12 NanoCode012

facing the same issue, is there any solution for this?

Ki6an avatar Dec 22 '23 04:12 Ki6an

Any new updates? I am facing the same problem; The first 40 steps are fine and then collapse (notice no data exchange occured via RDMA starting at the step 40; and after ddp_timeout times, the training is taken down).

yuleiqin avatar Jan 21 '24 03:01 yuleiqin

Hello @Ki6an @yuleiqin , have you tried the linked nccl doc if it solved it?

NanoCode012 avatar Mar 30 '24 18:03 NanoCode012