Running into a torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed.
Hi,
I'd like to express my gratitude for torchtune as it provides me with a high level of abstraction when trying to experiment with various post-training strategies.
However, in my experimentation, I'm running into an error. I'm trying to fine-tune a Phi4 model using the recipe lora_finetune_distributed. I'm using a custom dataset formatted using alpaca cleaned dataset format. I tried to do this on 4 GPUs (the config yaml file recommends 2+ GPUs).
The command I ran is the following:
tune run --nproc_per_node 4 lora_finetune_distributed --config <custom_14B_lora_config.yaml>
Here is the config:
output_dir: /home/users/vincent/torchtune/phi-4/14B_lora_V15
# Model arguments
model:
_component_: torchtune.models.phi4.lora_phi4_14b
lora_attn_modules: ['q_proj', 'v_proj', 'output_proj']
apply_lora_to_mlp: True
apply_lora_to_output: False
lora_rank: 8 # higher increases accuracy and memory
lora_alpha: 16 # usually alpha=2*rank
lora_dropout: 0.0
# Tokenizer
tokenizer:
_component_: torchtune.models.phi4.phi4_tokenizer
vocab_path: /home/users/vincent/base_models/phi-4/vocab.json
merges_path: /home/users/vincent/base_models/phi-4/merges.txt
max_seq_len: null
# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /home/users/vincent/base_models/phi-4
checkpoint_files: [
model-00001-of-00006.safetensors,
model-00002-of-00006.safetensors,
model-00003-of-00006.safetensors,
model-00004-of-00006.safetensors,
model-00005-of-00006.safetensors,
model-00006-of-00006.safetensors,
]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI4
resume_from_checkpoint: False
save_adapter_weights_only: False
# Dataset
dataset:
_component_: torchtune.datasets.alpaca_cleaned_dataset
source: json
data_files: //home/users/vincent/train.json
column_map:
instruction: instruction
input: input
output: output
train_on_input: False
packed: False
split: train
seed: 42
shuffle: True
# Fine-tuning arguments
epochs: 15
max_steps_per_epoch: null
batch_size: 2
gradient_accumulation_steps: 8 # Use to increase effective batch size
optimizer:
_component_: torch.optim.AdamW
fused: True
weight_decay: 0.01
lr: 3e-4
lr_scheduler:
_component_: torchtune.training.lr_schedulers.get_cosine_schedule_with_warmup
num_warmup_steps: 100
loss:
_component_: torchtune.modules.loss.CEWithChunkedOutputLoss
compile: False # torch.compile the model + loss, True increases speed + decreases memory
# Training env
device: cuda
# Memory management
enable_activation_checkpointing: True # True reduces memory
enable_activation_offloading: True # True reduces memory
dtype: bf16
# Logging
metric_logger:
_component_: torchtune.training.metric_logging.DiskLogger
log_dir: ${output_dir}/logs
log_every_n_steps: 1
log_peak_memory_stats: True
# Profiler (disabled)
profiler:
_component_: torchtune.training.setup_torch_profiler
enabled: False
#Output directory of trace artifacts
output_dir: ${output_dir}/profiling_outputs
#`torch.profiler.ProfilerActivity` types to trace
cpu: True
cuda: True
#trace options passed to `torch.profiler.profile`
profile_memory: False
with_stack: False
record_shapes: True
with_flops: False
# `torch.profiler.schedule` options:
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat
wait_steps: 5
warmup_steps: 3
active_steps: 2
num_cycles: 1
The error happens when an epoch is complete and the checkpoint is being saved. The following error I got was after the first epoch was complete.
The full error trace is as follows:
...
INFO:torchtune.utils._logging:Saving checkpoint. This may take some time. Retrieving full model state dict...
INFO:torchtune.utils._logging:Getting full model state dict took 19.69 secs
INFO:torchtune.utils._logging:Retrieving optimizer state dict...
INFO:torchtune.utils._logging:Getting optimizer state dict took 20.07 secs
[rank3]: Traceback (most recent call last):
[rank3]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 933, in <module>
[rank3]: sys.exit(recipe_main())
[rank3]: File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank3]: sys.exit(recipe_main(conf))
[rank3]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 928, in recipe_main
[rank3]: recipe.train()
[rank3]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 894, in train
[rank3]: self.save_checkpoint(epoch=curr_epoch)
[rank3]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 745, in save_checkpoint
[rank3]: torch.distributed.barrier()
[rank3]: File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank3]: return func(*args, **kwargs)
[rank3]: File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
[rank3]: work.wait()
[rank3]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 933, in <module>
[rank1]: sys.exit(recipe_main())
[rank1]: File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank1]: sys.exit(recipe_main(conf))
[rank1]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 928, in recipe_main
[rank1]: recipe.train()
[rank1]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 894, in train
[rank1]: self.save_checkpoint(epoch=curr_epoch)
[rank1]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 745, in save_checkpoint
[rank1]: torch.distributed.barrier()
[rank1]: File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
[rank1]: work.wait()
[rank1]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 933, in <module>
[rank2]: sys.exit(recipe_main())
[rank2]: File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank2]: sys.exit(recipe_main(conf))
[rank2]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 928, in recipe_main
[rank2]: recipe.train()
[rank2]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 894, in train
[rank2]: self.save_checkpoint(epoch=curr_epoch)
[rank2]: File "/home/users/vs-venv/lib/python3.10/site-packages/recipes/lora_finetune_distributed.py", line 745, in save_checkpoint
[rank2]: torch.distributed.barrier()
[rank2]: File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank2]: return func(*args, **kwargs)
[rank2]: File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4556, in barrier
[rank2]: work.wait()
[rank2]: RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[W503 01:52:07.597517393 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W503 01:48:33.433175032 socket.cpp:464] [c10d] waitForInput: poll for socket SocketImpl(fd=54, addr=[::2a72:0:b80a:e7e0:5f5a:0]:60386, remote=[localhost]:3
8081) returned 0, likely a timeout
[W503 01:52:54.119670007 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W503 01:52:26.348969241 socket.cpp:489] [c10d] waitForInput: socket SocketImpl(fd=54, addr=[4015:e2eb:fd7f:0:b80a:e7e0:5f5a:0]:60386, remote=[localhost]:38
081) timed out after 60000ms
W0503 02:31:14.231000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880960 closing signal SIGTERM
W0503 02:31:14.235000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880961 closing signal SIGTERM
W0503 02:31:14.235000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880962 closing signal SIGTERM
W0503 02:31:14.238000 880626 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 880963 closing signal SIGTERM
W0503 02:31:44.248000 880626 torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 880960 via Signals.SIGTERM, forcefully exiting
via Signals.SIGKILL
Traceback (most recent call last):
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 117, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistStoreError: wait timeout after 60000ms, keys: /torch.rendezvous.4544bd9d-76b0-4ad0-9489-ea243ae5e19a
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/users/vs-venv/bin/tune", line 8, in <module>
sys.exit(main())
File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 52, in main
parser.run(args)
File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/tune.py", line 46, in run
args.func(args)
File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 212, in _run_cmd
self._run_distributed(args, is_builtin=is_builtin)
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355,
in wrapper
return f(*args, **kwargs)
File "/home/users/vs-venv/lib/python3.10/site-packages/torchtune/_cli/run.py", line 101, in _run_distributed
run(args)
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent
result = agent.run()
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper
result = f(*args, **kwargs)
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 711, in run
result = self._invoke_run(role)
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 906, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1255, i
n num_nodes_waiting
self._state_holder.sync()
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 437, in
sync
get_response = self._backend.get_state()
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 75
, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/home/users/vs-venv/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 11
9, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for de
tails.
My setup:
- python 3.10
- torchtune== 0.6.1, torch==2.6.0
- 1 node, 4 NVIDIA A100 80GB GPUs. I was using all 4 GPUs
I tried using 2 GPUs only but it did not help. I got a similar error trace. Googling the error message didn't help a ton.
Interestingly, I tried using only 100 samples from the dataset, keeping everything else the same (i.e., the only difference is the length of the dataset used for fine-tuning), and that experiment ran to completion without any problems.
Can anyone help with this issue? Please let me know if there's anything that I missed here that might point me in the direction of a solution.
Thanks, Vincent
Thanks for the issue! This looks like an interesting problem. I will try to reproduce.
Please update here with any finding, would be much appreciated! I haven't found anything helpful on the web so far :/
Sorry for the significant delay here, will try to look today
Any update?