Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

nccl timeout

Open CriDora opened this issue 10 months ago • 2 comments

When running the AR experiment of VALLE_V2, the following error occurred. I have found many solutions on the Internet but none of them worked. Have you encountered this problem before?

Training Epoch 0: 17%|[32m█▋ [0m| 4002/23438 [19:18<1:21:05, 4.00batch/s] Training Epoch 0: 17%|[32m█▋ [0m| 4003/23438 [19:18<1:21:47, 3.96batch/s][E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800994 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801003 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out. [E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out. Saving state to /mnt/workspace/liuhw/Amphion/ckpt/VALLE_V2_wavtokenizer/wavtokenizer_large75_lr1e-4_8layer_3s-15s_libritts_180H_step80w/checkpoint/epoch-0000_step-0001000_loss-8.328300... 2025-02-04 18:25:30 | INFO | accelerate.accelerator | Saving current state to /mnt/workspace/liuhw/Amphion/ckpt/VALLE_V2_wavtokenizer/wavtokenizer_large75_lr1e-4_8layer_3s-15s_libritts_180H_step80w/checkpoint/epoch-0000_step-0001000_loss-8.328300 server17:3881907:3881998 [0] NCCL INFO comm 0x498bd450 rank 0 nranks 4 cudaDev 0 busId 1e000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. server17:3881910:3882004 [0] NCCL INFO comm 0x490369c0 rank 3 nranks 4 cudaDev 3 busId 3f000 - Abort COMPLETE server17:3881908:3882007 [0] NCCL INFO comm 0x46d3ff60 rank 1 nranks 4 cudaDev 1 busId 3d000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. server17:3881909:3882001 [0] NCCL INFO comm 0x47fce560 rank 2 nranks 4 cudaDev 2 busId 3e000 - Abort COMPLETE [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. Traceback (most recent call last): File "/mnt/workspace/liuhw/Amphion/.//bins/tts/train.py", line 156, in main() File "/mnt/workspace/liuhw/Amphion/.//bins/tts/train.py", line 152, in main trainer.train_loop() File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/base_trainer.py", line 321, in train_loop train_loss = self._train_epoch() File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/base_trainer.py", line 402, in _train_epoch loss = self._train_step(batch) File "/mnt/workspace/liuhw/Amphion/models/tts/valle_v2_wavtokenizer/valle_ar_trainer.py", line 199, in _train_step out = self.model( File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1148, in forward self._sync_buffers() File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1748, in _sync_buffers self._sync_module_buffers(authoritative_rank) File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1752, in _sync_module_buffers self._default_broadcast_coalesced( File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1775, in _default_broadcast_coalesced self._distributed_broadcast_coalesced( File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1689, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=19022, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801005 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 3881907) of binary: /mnt/workspace/liuhw/miniconda/envs/Amphion/bin/python Traceback (most recent call last): File "/mnt/workspace/liuhw/miniconda/envs/Amphion/bin/accelerate", line 8, in sys.exit(main()) File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 985, in launch_command multi_gpu_launcher(args) File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher distrib_run.run(args) File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/mnt/workspace/liuhw/miniconda/envs/Amphion/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

.//bins/tts/train.py FAILED

Failures: [1]: time : 2025-02-04_18:25:32 host : server17 rank : 1 (local_rank: 1) exitcode : -6 (pid: 3881908) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 3881908 [2]: time : 2025-02-04_18:25:32 host : server17 rank : 2 (local_rank: 2) exitcode : -6 (pid: 3881909) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 3881909 [3]: time : 2025-02-04_18:25:32 host : server17 rank : 3 (local_rank: 3) exitcode : -6 (pid: 3881910) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 3881910

Root Cause (first observed failure): [0]: time : 2025-02-04_18:25:32 host : server17 rank : 0 (local_rank: 0) exitcode : -6 (pid: 3881907) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 3881907

CriDora avatar Feb 12 '25 06:02 CriDora

Hi@CriDora, thanks for using our code! From my experience, the nccl timeout problem is usually related to a problem in the dataset loading, like it takes too long time to load a specific data file. maybe you can add a timeout in data loading; another possibility is try updating the accelerate package: pip install -U accelerate.

jiaqili3 avatar Feb 12 '25 11:02 jiaqili3

Hi@CriDora, thanks for using our code! From my experience, the nccl timeout problem is usually related to a problem in the dataset loading, like it takes too long time to load a specific data file. maybe you can add a timeout in data loading; another possibility is try updating the accelerate package: pip install -U accelerate.

Thank you for your answer. I tried the method you mentioned but it still didn't solve the problem. This problem occurs when saving checkpoints, and only occurs in multi-gpu training. Single-gpu training will not go wrong. Due to hardware limitations, the batchsize of each card is set to 1. Is it possible that the graphics card memory is insufficient?

CriDora avatar Feb 17 '25 06:02 CriDora