accelerate
accelerate copied to clipboard
`[Rank X] Watchdog caught collective operation timeout: WorkNCCL(...)` when using facebook-hydra multirun
System Info
Accelerate 0.10.0
Ubuntu 20.04.4 LTS
Python 3.9.12
Numpy 1.19.5
Pytorch 1.10.0
Hydra 1.2.0
Accelerate configs:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
use_cpu: false
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
) - [X] My own task or dataset (give details below)
Reproduction
- Wrap a training script with the hydra decorator
- Specify a dummy argument in your config.yaml file
- Run the script as
accelerate launch train.py -m dummy argument=1,2,3
When doing this the first run works just fine, whereas during the second one I get (note that this is from a a custom training script):
[E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801371 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801462 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801580 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this
inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801580 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this
inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this
inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801371 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801462 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3397021 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3397022) of binary: /home/rmiccini/.conda/envs/speakerid_training/bin/python
Traceback (most recent call last):
File "/home/rmiccini/.conda/envs/speakerid_training/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main
run(args)
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run
elastic_launch(
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
[1]:
time : 2022-07-28_01:41:02
host : ai01.local
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 3397023)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3397023
[2]:
time : 2022-07-28_01:41:02
host : ai01.local
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 3397024)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3397024
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2022-07-28_01:41:02
host : ai01.local
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 3397022)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3397022
========================================================
Traceback (most recent call last):
File "/home/rmiccini/.conda/envs/speakerid_training/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
args.func(args)
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/accelerate/commands/launch.py", line 562, in launch_command
multi_gpu_launcher(args)
File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/accelerate/commands/launch.py", line 306, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'train.py', '-m', 'augmentation=baseline,default,noise_reverb,noise_specaugment,noise,reverb_specaugment,reverb,specaugment',
'dataset=librispeech']' returned non-zero exit status 1.
Expected behavior
The multi-run goes through successfully and each job is executed.
I.e., it would be great if both libraries could co-exist in a project.
There is no reason Hydra shouldn't work with Accelerate, and the error message does not suggest there is any friction between them. Please use the forums to help debug your code (make sure to share your script so the community can help!) as we keep issues for bugs and feature requests only.
You can create a new post in the Accelerate section here, thanks! 🤗
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @miccio-dk, have you solved your problem? If so, could you please share your finding?
hi, i couldn't resolve the issue so i elected to not use multi-gpu training.
same error here
same