torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Running into a torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed.

Open vsoesanto opened this issue 8 months ago • 4 comments

Hi,

I'd like to express my gratitude for torchtune as it provides me with a high level of abstraction when trying to experiment with various post-training strategies.

However, in my experimentation, I'm running into an error. I'm trying to fine-tune a Phi4 model using the recipe lora_finetune_distributed. I'm using a custom dataset formatted using alpaca cleaned dataset format. I tried to do this on 4 GPUs (the config yaml file recommends 2+ GPUs).

The command I ran is the following: tune run --nproc_per_node 4 lora_finetune_distributed --config <custom_14B_lora_config.yaml>

Here is the config:

output_dir: /home/users/vincent/torchtune/phi-4/14B_lora_V15 

# Model arguments
model:
  _component_: torchtune.models.phi4.lora_phi4_14b
  lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
  apply_lora_to_mlp: True
  apply_lora_to_output: False
  lora_rank: 8  # higher increases accuracy and memory
  lora_alpha: 16  # usually alpha=2*rank
  lora_dropout: 0.0

# Tokenizer
tokenizer:
  _component_: torchtune.models.phi4.phi4_tokenizer
  vocab_path: /home/users/vincent/base_models/phi-4/vocab.json
  merges_path: /home/users/vincent/base_models/phi-4/merges.txt
  max_seq_len: null

# Checkpointer
checkpointer:
  _component_: torchtune.training.FullModelHFCheckpointer
  checkpoint_dir: /home/users/vincent/base_models/phi-4
  checkpoint_files: [
    model-00001-of-00006.safetensors,
    model-00002-of-00006.safetensors,
    model-00003-of-00006.safetensors,
    model-00004-of-00006.safetensors,
    model-00005-of-00006.safetensors,
    model-00006-of-00006.safetensors,
  ]
  recipe_checkpoint: null
  output_dir: ${output_dir}
  model_type: PHI4
resume_from_checkpoint: False
save_adapter_weights_only: False

# Dataset
dataset:
  _component_: torchtune.datasets.alpaca_cleaned_dataset
  source: json
  data_files: //home/users/vincent/train.json
  column_map:
    instruction: instruction
    input: input
    output: output
  train_on_input: False
  packed: False
  split: train
seed: 42
shuffle: True

# Fine-tuning arguments
epochs: 15
max_steps_per_epoch: null
batch_size: 2
gradient_accumulation_steps: 8  # Use to increase effective batch size
optimizer:
  _component_: torch.optim.AdamW
  fused: True
  weight_decay: 0.01
  lr: 3e-4
lr_scheduler:
  _component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
  num_warmup_steps: 100
loss:
  _component_: torchtune.modules.loss.CEWithChunkedOutputLoss
compile: False  # torch.compile the model + loss, True increases speed + decreases memory

# Training env
device: cuda

# Memory management
enable_activation_checkpointing: True  # True reduces memory
enable_activation_offloading: True  # True reduces memory
dtype: bf16

# Logging
metric_logger:
  _component_: torchtune.training.metric_logging.DiskLogger
  log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True


# Profiler (disabled)
profiler:
  _component_: torchtune.training.setup_torch_profiler
  enabled: False

  #Output directory of trace artifacts
  output_dir: ${output_dir}/profiling_outputs

  #`torch.profiler.ProfilerActivity` types to trace
  cpu: True
  cuda: True

  #trace options passed to `torch.profiler.profile`
  profile_memory: False
  with_stack: False
  record_shapes: True
  with_flops: False

  # `torch.profiler.schedule` options:
  # wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
  wait_steps: 5
  warmup_steps: 3
  active_steps: 2
  num_cycles: 1

The error happens when an epoch is complete and the checkpoint is being saved. The following error I got was after the first epoch was complete.

The full error trace is as follows:

...
INFO:torchtune.utils._logging:Saving checkpoint. This may take some time. Retrieving full model state dict...                            
INFO:torchtune.utils._logging:Getting full model state dict took 19.69 secs                                                                
INFO:torchtune.utils._logging:Retrieving optimizer state dict...                                                                           
INFO:torchtune.utils._logging:Getting optimizer state dict took 20.07 secs 
[rank3]: Traceback (most recent call last):                                                                                                                 
[rank3]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 933, in <module>                     
[rank3]:     sys.exit(recipe_main())                                                                                                                        
[rank3]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper                                 
[rank3]:     sys.exit(recipe_main(conf))                                                                                                                    
[rank3]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 928, in recipe_main                  
[rank3]:     recipe.train()                                                                                                                                 
[rank3]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 894, in train                        
[rank3]:     self.save_checkpoint(epoch=curr_epoch)                                                                                                         
[rank3]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 745, in save_checkpoint              
[rank3]:     torch.distributed.barrier()                                                                                                                    
[rank3]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper                           
[rank3]:     return func(*args, **kwargs)                                                                                                                   
[rank3]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier                    
[rank3]:     work.wait()               
[rank3]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
[rank1]: Traceback (most recent call last):                                   
[rank1]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 933, in <module>
[rank1]:     sys.exit(recipe_main())                                          
[rank1]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]:     sys.exit(recipe_main(conf))                                      
[rank1]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 928, in recipe_main
[rank1]:     recipe.train()            
[rank1]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 894, in train
[rank1]:     self.save_checkpoint(epoch=curr_epoch)                           
[rank1]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 745, in save_checkpoint              
[rank1]:     torch.distributed.barrier()                                      
[rank1]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)                                     
[rank1]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
[rank1]:     work.wait()               
[rank1]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[rank2]: Traceback (most recent call last):                                   
[rank2]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 933, in <module>
[rank2]:     sys.exit(recipe_main())                                          
[rank2]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank2]:     sys.exit(recipe_main(conf))                                      
[rank2]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 928, in recipe_main
[rank2]:     recipe.train()            
[rank2]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 894, in train
[rank2]:     self.save_checkpoint(epoch=curr_epoch)                           
[rank2]:   File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 745, in save_checkpoint              
[rank2]:     torch.distributed.barrier()                                      
[rank2]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]:     return func(*args, **kwargs)                                     
[rank2]:   File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
[rank2]:     work.wait()               
[rank2]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[W503 01:52:07.597517393 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3                                               
[W503 01:48:33.433175032 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=54, addr=[::2a72:0:b80a:e7e0:5f5a:0]:60386, remote=[localhost]:3
8081) returned 0, likely a timeout     
[W503 01:52:54.119670007 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3                                               
[W503 01:52:26.348969241 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=54, addr=[4015:e2eb:fd7f:0:b80a:e7e0:5f5a:0]:60386, remote=[localhost]:38
081) timed out after 60000ms           
W0503 02:31:14.231000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880960 closing signal SIGTERM
W0503 02:31:14.235000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880961 closing signal SIGTERM
W0503 02:31:14.235000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880962 closing signal SIGTERM
W0503 02:31:14.238000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880963 closing signal SIGTERM
W0503 02:31:44.248000 880626 torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 880960 via Signals.SIGTERM, forcefully exiting
 via Signals.SIGKILL  
Traceback (most recent call last):                                                                                                                          
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)                                                                                                  
torch.distributed.DistStoreError: wait timeout after 60000ms, keys: /torch.rendezvous.4544bd9d-76b0-4ad0-9489-ea243ae5e19a 

The above exception was the direct cause of the following exception:                                                                       

Traceback (most recent call last):                                   
  File "/home/users/vs-venv/bin/tune", line 8, in <module>                                                                        
    sys.exit(main())                                                 
  File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
    parser.run(args)                                                 
  File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
    args.func(args)                                                  
  File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)                                                                                     
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, 
in wrapper                        
    return f(*args, **kwargs)                                        
  File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
    run(args)                                                        
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(                                                  
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))                                                                        
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
    result = agent.run()                                             
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
    result = f(*args, **kwargs)                                      
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
    result = self._invoke_run(role)                                  
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _invoke_run
    num_nodes_waiting = rdzv_handler.num_nodes_waiting()                                                                                   
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1255, i
n num_nodes_waiting   
    self._state_holder.sync()                                        
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in
 sync                             
    get_response = self._backend.get_state()                                                                                               
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75
, in get_state                    
    base64_state: bytes = self._call_store("get", self._key)                                                                               
  File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 11
9, in _call_store                 
    raise RendezvousConnectionError(                                                                                                       
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for de
tails.                                         

My setup:

  • python 3.10
  • torchtune== 0.6.1, torch==2.6.0
  • 1 node, 4 NVIDIA A100 80GB GPUs. I was using all 4 GPUs

I tried using 2 GPUs only but it did not help. I got a similar error trace. Googling the error message didn't help a ton.

Interestingly, I tried using only 100 samples from the dataset, keeping everything else the same (i.e., the only difference is the length of the dataset used for fine-tuning), and that experiment ran to completion without any problems.

Can anyone help with this issue? Please let me know if there's anything that I missed here that might point me in the direction of a solution.

Thanks, Vincent

vsoesanto avatar May 05 '25 23:05 vsoesanto

Thanks for the issue! This looks like an interesting problem. I will try to reproduce.

krammnic avatar May 06 '25 09:05 krammnic

Please update here with any finding, would be much appreciated! I haven't found anything helpful on the web so far :/

vsoesanto avatar May 15 '25 16:05 vsoesanto

Sorry for the significant delay here, will try to look today

krammnic avatar May 28 '25 18:05 krammnic

Any update?

Opdoop avatar Aug 06 '25 12:08 Opdoop