axolotl
axolotl copied to clipboard
[Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out.
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Training in Mixtral model
^M 13%|█▎ | 3004/23865 [5:58:04<41:19:58, 7.13s/it][E ProcessGroupNCCL.cpp:475] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=419044352, NumelOut=419044352, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=419044352, NumelOut=419044352, Timeout(ms)=1800000) ran for 1800055 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 7] NCCL watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20880, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800696 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800721 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800435 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20880, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800529 milliseconds before timing out.
Current behaviour
Training process broken at 13%
13%|█▎ | 3004/23865 [5:58:04<41:19:58,
Steps to reproduce
Traing qlora with params of { rank=128 alpha=64 FA2=True with layernorm and xentropy }
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
I got a similar error doing a finetune of mixtral:
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8378, OpType=REDUCE, NumelIn=100700160, NumelOut=100700160, Timeout(ms)=1800000) ran for 1800848 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=8378, OpType=REDUCE, NumelIn=100700160, NumelOut=100700160, Timeout(ms)=1800000) ran for 1800848 milliseconds before timing out. [2023-12-17 03:23:16,548] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1598 closing signal SIGTERM [2023-12-17 03:23:17,983] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 1599) of binary: /home/vince/miniconda3/envs/distil/bin/python
Using deepspeed
Could you see if there's some earlier errors?
there is not an error before that. Basically it would train fine up to 40 steps then say it timed out. I think its something to do with comunication between gpus. Accelerate did this in the last release: "We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error."
I tried a few diffrent permutations of pytorch 2+ and diffrent versions of accelerate. Did not change anything. Training here was done with deepspeed with cpu offloading of optimizer. Training on 2 4090s.
Is this on runpod? Could you try the nccl doc? https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md
Nope its on lambda lab
| | xdan_dev | | @.*** |
---- Replied Message ---- | From | @.> | | Date | 12/19/2023 16:41 | | To | @.> | | Cc | @.> , @.> | | Subject | Re: [OpenAccess-AI-Collective/axolotl] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20879, OpType=ALLREDUCE, NumelIn=406069248, NumelOut=406069248, Timeout(ms)=1800000) ran for 1800115 milliseconds before timing out. (Issue #967) |
Is this on runpod? Could you try the nccl doc? https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/docs/nccl.md
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>
Could you give that doc a try either way?
facing the same issue, is there any solution for this?
Any new updates? I am facing the same problem; The first 40 steps are fine and then collapse (notice no data exchange occured via RDMA starting at the step 40; and after ddp_timeout times, the training is taken down).
Hello @Ki6an @yuleiqin , have you tried the linked nccl doc if it solved it?