accelerate `[Rank X] Watchdog caught collective operation timeout: WorkNCCL(...)` when using facebook-hydra multirun

System Info

Accelerate 0.10.0
Ubuntu 20.04.4 LTS
Python 3.9.12
Numpy 1.19.5
Pytorch 1.10.0
Hydra 1.2.0

Accelerate configs:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MULTI_GPU
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
use_cpu: false

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

Wrap a training script with the hydra decorator
Specify a dummy argument in your config.yaml file
Run the script as accelerate launch train.py -m dummy argument=1,2,3

When doing this the first run works just fine, whereas during the second one I get (note that this is from a a custom training script):

[E ProcessGroupNCCL.cpp:719] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801371 milliseconds before timing out.  
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801462 milliseconds before timing out.          
[E ProcessGroupNCCL.cpp:719] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801580 milliseconds before timing out.          
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this
 inconsistency, we are taking the entire process down.                                                                                                                                                     
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                        
  what():  [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801580 milliseconds before timing out.                            
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this
 inconsistency, we are taking the entire process down.                                                                                                                                                     
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this
 inconsistency, we are taking the entire process down.                                                                                                                                                     
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                        
  what():  [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801371 milliseconds before timing out.                            
terminate called after throwing an instance of 'std::runtime_error'                                                                                                                                        
  what():  [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16246, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801462 milliseconds before timing out.                            
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3397021 closing signal SIGTERM                                                                                                       
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3397022) of binary: /home/rmiccini/.conda/envs/speakerid_training/bin/python                                 
Traceback (most recent call last):                                                                                                                                                                         
  File "/home/rmiccini/.conda/envs/speakerid_training/bin/torchrun", line 33, in <module>                                                                                                                  
    sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')())                                                                                                                           
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper                                      
    return f(*args, **kwargs)                                                                                                                                                                              
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/run.py", line 724, in main                                                                             
    run(args)                                                                                                                                                                                              
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/run.py", line 715, in run                                                                              
    elastic_launch(                                                                                  
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                                                        
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(                                                                          
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                                                                                                         
========================================================                                                                                                                                                   
train.py FAILED                                                                                      
--------------------------------------------------------
Failures:                                                                                            
[1]:                                                                                                 
  time      : 2022-07-28_01:41:02                                                                    
  host      : ai01.local                                                                             
  rank      : 2 (local_rank: 2)                                                                      
  exitcode  : -6 (pid: 3397023)                                                                      
  error_file: <N/A>                                                                                  
  traceback : Signal 6 (SIGABRT) received by PID 3397023                                                                                                                                                   
[2]:                                                                                                 
  time      : 2022-07-28_01:41:02                                                                    
  host      : ai01.local                                                                             
  rank      : 3 (local_rank: 3)                                                                      
  exitcode  : -6 (pid: 3397024)                                                                      
  error_file: <N/A>                                                                                  
  traceback : Signal 6 (SIGABRT) received by PID 3397024                                                                                                                                                   
--------------------------------------------------------                                                                                                                                                   
Root Cause (first observed failure):                                                                 
[0]:                                                                                                 
  time      : 2022-07-28_01:41:02                                                                    
  host      : ai01.local                                                                             
  rank      : 1 (local_rank: 1)                                                                      
  exitcode  : -6 (pid: 3397022)                                                                      
  error_file: <N/A>                                                                                  
  traceback : Signal 6 (SIGABRT) received by PID 3397022                                                                                                                                                   
========================================================                                                                                                                                                   
Traceback (most recent call last):                                                                   
  File "/home/rmiccini/.conda/envs/speakerid_training/bin/accelerate", line 8, in <module>                                                                                                                 
    sys.exit(main())                                                                                 
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)                                                                                  
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/accelerate/commands/launch.py", line 562, in launch_command
    multi_gpu_launcher(args)                                                                         
  File "/home/rmiccini/.conda/envs/speakerid_training/lib/python3.9/site-packages/accelerate/commands/launch.py", line 306, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)                                                                                                                            
subprocess.CalledProcessError: Command '['torchrun', '--nproc_per_node', '4', 'train.py', '-m', 'augmentation=baseline,default,noise_reverb,noise_specaugment,noise,reverb_specaugment,reverb,specaugment',
 'dataset=librispeech']' returned non-zero exit status 1.

Expected behavior

The multi-run goes through successfully and each job is executed.
I.e., it would be great if both libraries could co-exist in a project.

Jul 28 '22 17:07 miccio-dk

There is no reason Hydra shouldn't work with Accelerate, and the error message does not suggest there is any friction between them. Please use the forums to help debug your code (make sure to share your script so the community can help!) as we keep issues for bugs and feature requests only.

You can create a new post in the Accelerate section here, thanks! 🤗

Jul 28 '22 17:07 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.