vllm [Bug]: client socket has timed out while trying to connect to GPU node, when initializing DeepSeek R1 in ray vllm serving

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

I construct a ray cluster and try to deploy several DeepSeek-R1 replicas. pipeline-parallel-size: 3 Often, the model initialization would fail (not every time, could turn succeed after several retries) I'm on ray[serve]==2.44.0, vllm==0.8.2. This issue starts ever since 0.7.0. I've verified that it works well on 0.6.6.post1, but every version after that would possible to trigger below error msg when initializing multi-node models, (in our case is R1)

Error:

:job_id:02000000
:actor_name:ServeReplica:DS-R1:vllmDeployment
INFO 2025-03-29 06:17:53,742 DS-R1_vllmDeployment a3e34qi7 -- Starting with engine args: AsyncEngineArgs(model='DeepSeek-R1', served_model_name=None, tokenizer='DeepSeek-R1', hf_config_path=None, task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', seed=None, max_model_len=16384, distributed_executor_backend='ray', pipeline_parallel_size=3, tensor_parallel_size=8, enable_expert_parallel=False, max_parallel_loading_workers=None, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, disable_cascade_attn=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, generation_config='auto', override_generation_config=None, enable_sleep_mode=False, model_impl='auto', calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, use_tqdm_on_load=True, disable_log_requests=False)
[E329 06:28:12.542939754 socket.cpp:1023] [c10d] The client socket has timed out after 600000ms while trying to connect to (10.xxx.xxx.33, 54485).
[W329 06:28:12.581551416 TCPStore.cpp:330] [c10d] TCP client failed to connect/validate to host 10.xxx.xxx.33:54485 - retrying (try=0, timeout=600000ms, delay=84662ms): The client socket has timed out after 600000ms while trying to connect to (10.xxx.xxx.33, 54485).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1025 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74f4a7f6c1b6 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x16144fe (0x74c4ec5584fe in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x63501ce (0x74c4f12941ce in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x6350386 (0x74c4f1294386 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x63507f4 (0x74c4f12947f4 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x630d216 (0x74c4f1251216 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) + 0x20c (0x74c4f125414c in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe402df (0x74f44b8cf2df in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x518d37 (0x74f44afa7d37 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x4fdcf7]
frame #10: _PyObject_MakeTpCall + 0x25b (0x4f747b in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #11: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509d6f]
frame #12: PyVectorcall_Call + 0xb9 (0x50a909 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #13: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x507a4c]
frame #14: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x4f77e6]
frame #15: <unknown function> + 0x51752b (0x74f44afa652b in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: _PyObject_MakeTpCall + 0x25b (0x4f747b in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #17: _PyEval_EvalFrameDefault + 0x56d2 (0x4f3802 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #18: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #19: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #20: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x572387]
frame #21: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x4fe324]
frame #22: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #23: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #24: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #25: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #26: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #27: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #28: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #29: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #30: _PyEval_EvalFrameDefault + 0x13b3 (0x4ef4e3 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #31: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #32: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #33: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #34: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #35: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #36: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #37: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509d07]
frame #38: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #39: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #40: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #41: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509bd6]
frame #42: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #43: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #44: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #45: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #46: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #47: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #48: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #49: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509a7e]
frame #50: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #51: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #52: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #53: _PyObject_FastCallDictTstate + 0x17d (0x4f687d in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #54: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x5075b8]
frame #55: _PyObject_MakeTpCall + 0x2ab (0x4f74cb in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #56: _PyEval_EvalFrameDefault + 0x56d2 (0x4f3802 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #57: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509a7e]
frame #58: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #59: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #60: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #61: _PyObject_FastCallDictTstate + 0x17d (0x4f687d in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #62: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x5075b8]

could be similar issue with https://github.com/vllm-project/vllm/issues/13052

When the model loads well, it should look like

:job_id:02000000
:actor_name:ServeReplica:DS-R1:vllmDeployment
INFO 2025-03-29 06:17:53,742 DS-R1_vllmDeployment a3e34qi7 -- Starting with engine args: AsyncEngineArgs(model='DeepSeek-R1', served_model_name=None, tokenizer='DeepSeek-R1', hf_config_path=None, task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', seed=None, max_model_len=16384, distributed_executor_backend='ray', pipeline_parallel_size=3, tensor_parallel_size=8, enable_expert_parallel=False, max_parallel_loading_workers=None, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, disable_cascade_attn=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, generation_config='auto', override_generation_config=None, enable_sleep_mode=False, model_impl='auto', calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, use_tqdm_on_load=True, disable_log_requests=False)
 Loading safetensors checkpoint shards: 0% Completed | 0/163 [00:00<?, ?it/s]
 Loading safetensors checkpoint shards: 1% Completed | 1/163 [00:00<01:28, 1.83it/s]
 Loading safetensors checkpoint shards: 1% Completed | 2/163 [00:16<25:20, 9.45s/it]
 Loading safetensors checkpoint shards: 2% Completed | 3/163 [00:16<14:02, 5.27s/it]
 Loading safetensors checkpoint shards: 2% Completed | 4/163 [00:29<21:32, 8.13s/it]
 Loading safetensors checkpoint shards: 3% Completed | 5/163 [00:34<18:39, 7.09s/it]
 Loading safetensors checkpoint shards: 4% Completed | 6/163 [00:34<12:35, 4.81s/it]
 Loading safetensors checkpoint shards: 4% Completed | 7/163 [00:34<08:37, 3.32s/it]

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Mar 29 '25 07:03 yangli5t

Same error with Llama running on two T4 nodes.

Apr 01 '25 14:04 andreapairon

Same error while creating two vllm.LLM instances on more than 2 GPU

vllm_model = vllm.LLM(
                model=model_inference_settings.model_settings.model_path.absolute().as_posix(),
                **engine_settings.dict(),
            )

sft_model = vllm.LLM(
   model=model_inference_settings.sft_settings.model_path.absolute().as_posix(),
   **sft_engine_settings.dict(),
)

Apr 06 '25 16:04 Myashka

I am also getting same TCP timeout error vllm: 0.82.0 ray: 2.43.0 setup with 2 h100(2gpus)

Tried flags NCCL_P2P_DISABLE=1, NCCL_NVLS_ENABLE=0, and --disable-custom-all-reduce, didn't work.

Apr 08 '25 08:04 bchandarr

I resolved the issue by running this following command export NCCL_SOCKER_IFNAME = <network_interface>

You can get the network_interface by running ifconfig choose the one has inet address like this 192.168.1.xx

Apr 11 '25 18:04 kietna1809

Tried setting NCCL_SOCKER_IFNAME, but din't work. Running in Azure k8s.

Apr 17 '25 18:04 bchandarr

same problem, NCCL_SOCKER_IFNAME does not work.

Jul 15 '25 18:07 yinghy18

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Oct 14 '25 02:10 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Nov 13 '25 02:11 github-actions[bot]