[Bug]: client socket has timed out while trying to connect to GPU node, when initializing DeepSeek R1 in ray vllm serving
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
I construct a ray cluster and try to deploy several DeepSeek-R1 replicas. pipeline-parallel-size: 3 Often, the model initialization would fail (not every time, could turn succeed after several retries) I'm on ray[serve]==2.44.0, vllm==0.8.2. This issue starts ever since 0.7.0. I've verified that it works well on 0.6.6.post1, but every version after that would possible to trigger below error msg when initializing multi-node models, (in our case is R1)
Error:
:job_id:02000000
:actor_name:ServeReplica:DS-R1:vllmDeployment
INFO 2025-03-29 06:17:53,742 DS-R1_vllmDeployment a3e34qi7 -- Starting with engine args: AsyncEngineArgs(model='DeepSeek-R1', served_model_name=None, tokenizer='DeepSeek-R1', hf_config_path=None, task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', seed=None, max_model_len=16384, distributed_executor_backend='ray', pipeline_parallel_size=3, tensor_parallel_size=8, enable_expert_parallel=False, max_parallel_loading_workers=None, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, disable_cascade_attn=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, generation_config='auto', override_generation_config=None, enable_sleep_mode=False, model_impl='auto', calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, use_tqdm_on_load=True, disable_log_requests=False)
[E329 06:28:12.542939754 socket.cpp:1023] [c10d] The client socket has timed out after 600000ms while trying to connect to (10.xxx.xxx.33, 54485).
[W329 06:28:12.581551416 TCPStore.cpp:330] [c10d] TCP client failed to connect/validate to host 10.xxx.xxx.33:54485 - retrying (try=0, timeout=600000ms, delay=84662ms): The client socket has timed out after 600000ms while trying to connect to (10.xxx.xxx.33, 54485).
Exception raised from throwTimeoutError at /pytorch/torch/csrc/distributed/c10d/socket.cpp:1025 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74f4a7f6c1b6 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x16144fe (0x74c4ec5584fe in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x63501ce (0x74c4f12941ce in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x6350386 (0x74c4f1294386 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x63507f4 (0x74c4f12947f4 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x630d216 (0x74c4f1251216 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::TCPStore::TCPStore(std::string, c10d::TCPStoreOptions const&) + 0x20c (0x74c4f125414c in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0xe402df (0x74f44b8cf2df in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0x518d37 (0x74f44afa7d37 in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x4fdcf7]
frame #10: _PyObject_MakeTpCall + 0x25b (0x4f747b in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #11: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509d6f]
frame #12: PyVectorcall_Call + 0xb9 (0x50a909 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #13: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x507a4c]
frame #14: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x4f77e6]
frame #15: <unknown function> + 0x51752b (0x74f44afa652b in /conda/envs/vllm/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #16: _PyObject_MakeTpCall + 0x25b (0x4f747b in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #17: _PyEval_EvalFrameDefault + 0x56d2 (0x4f3802 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #18: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #19: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #20: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x572387]
frame #21: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x4fe324]
frame #22: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #23: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #24: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #25: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #26: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #27: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #28: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #29: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #30: _PyEval_EvalFrameDefault + 0x13b3 (0x4ef4e3 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #31: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #32: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #33: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #34: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #35: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #36: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #37: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509d07]
frame #38: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #39: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #40: _PyEval_EvalFrameDefault + 0x31f (0x4ee44f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #41: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509bd6]
frame #42: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #43: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #44: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #45: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #46: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #47: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #48: _PyEval_EvalFrameDefault + 0x731 (0x4ee861 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #49: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509a7e]
frame #50: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #51: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #52: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #53: _PyObject_FastCallDictTstate + 0x17d (0x4f687d in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #54: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x5075b8]
frame #55: _PyObject_MakeTpCall + 0x2ab (0x4f74cb in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #56: _PyEval_EvalFrameDefault + 0x56d2 (0x4f3802 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #57: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x509a7e]
frame #58: PyObject_Call + 0xb8 (0x50a5a8 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #59: _PyEval_EvalFrameDefault + 0x2b79 (0x4f0ca9 in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #60: _PyFunction_Vectorcall + 0x6f (0x4fe13f in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #61: _PyObject_FastCallDictTstate + 0x17d (0x4f687d in ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata)
frame #62: ray::ServeReplica:DS-R1:vllmDeployment.initialize_and_get_metadata() [0x5075b8]
could be similar issue with https://github.com/vllm-project/vllm/issues/13052
When the model loads well, it should look like
:job_id:02000000
:actor_name:ServeReplica:DS-R1:vllmDeployment
INFO 2025-03-29 06:17:53,742 DS-R1_vllmDeployment a3e34qi7 -- Starting with engine args: AsyncEngineArgs(model='DeepSeek-R1', served_model_name=None, tokenizer='DeepSeek-R1', hf_config_path=None, task='auto', skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', seed=None, max_model_len=16384, distributed_executor_backend='ray', pipeline_parallel_size=3, tensor_parallel_size=8, enable_expert_parallel=False, max_parallel_loading_workers=None, block_size=None, enable_prefix_caching=True, disable_sliding_window=False, disable_cascade_attn=False, use_v2_block_manager=True, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, revision=None, code_revision=None, rope_scaling=None, rope_theta=None, hf_overrides=None, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, fully_sharded_loras=False, lora_extra_vocab_size=256, long_lora_scaling_factors=None, lora_dtype='auto', max_cpu_loras=None, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, guided_decoding_backend='xgrammar', logits_processor_pattern=None, speculative_config=None, speculative_model=None, speculative_model_quantization=None, speculative_draft_tensor_parallel_size=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, generation_config='auto', override_generation_config=None, enable_sleep_mode=False, model_impl='auto', calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, use_tqdm_on_load=True, disable_log_requests=False)
Loading safetensors checkpoint shards: 0% Completed | 0/163 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 1% Completed | 1/163 [00:00<01:28, 1.83it/s]
Loading safetensors checkpoint shards: 1% Completed | 2/163 [00:16<25:20, 9.45s/it]
Loading safetensors checkpoint shards: 2% Completed | 3/163 [00:16<14:02, 5.27s/it]
Loading safetensors checkpoint shards: 2% Completed | 4/163 [00:29<21:32, 8.13s/it]
Loading safetensors checkpoint shards: 3% Completed | 5/163 [00:34<18:39, 7.09s/it]
Loading safetensors checkpoint shards: 4% Completed | 6/163 [00:34<12:35, 4.81s/it]
Loading safetensors checkpoint shards: 4% Completed | 7/163 [00:34<08:37, 3.32s/it]
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Same error with Llama running on two T4 nodes.
Same error while creating two vllm.LLM instances on more than 2 GPU
vllm_model = vllm.LLM(
model=model_inference_settings.model_settings.model_path.absolute().as_posix(),
**engine_settings.dict(),
)
sft_model = vllm.LLM(
model=model_inference_settings.sft_settings.model_path.absolute().as_posix(),
**sft_engine_settings.dict(),
)
I am also getting same TCP timeout error vllm: 0.82.0 ray: 2.43.0 setup with 2 h100(2gpus)
Tried flags NCCL_P2P_DISABLE=1, NCCL_NVLS_ENABLE=0, and --disable-custom-all-reduce, didn't work.
I resolved the issue by running this following command
export NCCL_SOCKER_IFNAME = <network_interface>
You can get the network_interface by running ifconfig choose the one has inet address like this 192.168.1.xx
Tried setting NCCL_SOCKER_IFNAME, but din't work. Running in Azure k8s.
same problem, NCCL_SOCKER_IFNAME does not work.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!