vllm [Bug]: 单gpu没有任何反应（设置tensor_parallel

Your current environment

问题

🐛 Describe the bug

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

import torch

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained('/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct')

texts = []
# Prepare your prompts
# 定义批量数据
prompts = [
    "宪法规定的公民法律义务有",
    "属于专门人民法院的是",
    "无效婚姻的种类包括",
    "刑事案件定义",
    "税收法律制度",
]
for prompt in prompts:
    messages = [
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)

sampling_params = SamplingParams(temperature=0.1, top_p=0.5, max_tokens=4096)
path = '/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct'
llm = LLM(model=path, trust_remote_code=True, tokenizer_mode="auto", tensor_parallel_size=2, dtype=torch.float16)
outputs = llm.generate(texts, sampling_params)

# 输出结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

上面这段代码执行没问题，但是当我把tensor_parallel_size从2改成1希望在单卡上面部署离线推理，执行到

llm = LLM(model=path, trust_remote_code=True, tokenizer_mode="auto", tensor_parallel_size=1, dtype=torch.float16)

这一步只会报如下显示，然后就会一直没有反应，也不保错： $sudo CUDA_VISIBLE_DEVICES=0 PYTHONPATH="/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/" python vllm_test.py
WARNING 08-05 11:02:58 config.py:1425] Casting torch.bfloat16 to torch.float16. INFO 08-05 11:02:58 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct', speculative_config=None, tokenizer='/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-05 11:02:59 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-05 11:02:59 selector.py:54] Using XFormers backend. [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:36893 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [11-88-234-70.gpu-exporter.prometheus.svc.cluster.local]:36893 (errno: 97 - Address family not supported by protocol).

上面粗体是最后显示，然后就也不会报错一直这样，我该怎么解决，求求

Aug 05 '24 03:08 efficentdet

最后报了超时错误，该怎么解决？真的很急 $sudo CUDA_VISIBLE_DEVICES=0 PYTHONPATH="/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/" python vllm_test.py WARNING 08-05 11:02:58 config.py:1425] Casting torch.bfloat16 to torch.float16. INFO 08-05 11:02:58 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct', speculative_config=None, tokenizer='/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/ProjectRoot/long_content_LLM/qwen/Qwen2-1___5B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-05 11:02:59 selector.py:151] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 08-05 11:02:59 selector.py:54] Using XFormers backend. [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:36893 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [11-88-234-70.gpu-exporter.prometheus.svc.cluster.local]:36893 (errno: 97 - Address family not supported by protocol). [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (11.88.234.70, 36893). Traceback (most recent call last): File "vllm_test.py", line 32, in llm = LLM(model=path, trust_remote_code=True, tensor_parallel_size=1, dtype=torch.float16) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 155, in init self.llm_engine = LLMEngine.from_engine_args( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 441, in from_engine_args engine = cls( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 251, in init self.model_executor = executor_class( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/executor/executor_base.py", line 47, in init self._init_executor() File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 35, in _init_executor self.driver_worker.init_device() File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/worker/worker.py", line 132, in init_device init_worker_distributed_environment(self.parallel_config, self.rank, File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/worker/worker.py", line 343, in init_worker_distributed_environment init_distributed_environment(parallel_config.world_size, rank, File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/vllm/distributed/parallel_state.py", line 812, in init_distributed_environment torch.distributed.init_process_group( File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper return func(*args, **kwargs) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper func_return = func(*args, **kwargs) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1305, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 199, in _tcp_rendezvous_handler store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout, use_libuv) File "/GlobalData/rijian.lrj/miniconda3/envs/vllm_shc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store return TCPStore( torch.distributed.DistNetworkError: The client socket has timed out after 600s while trying to connect to (11.88.234.70, 36893).

Aug 05 '24 03:08 efficentdet

It seems this is an address problem. You started server on IPv6 but failed(not shown started on IPv4) [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:36893 (errno: 97 - Address family not supported by protocol). But you connect the port through IPv4 [E socket.cpp:957] [c10d] The client socket has timed out after 600s while trying to connect to (11.88.234.70, 36893).

You can try to change the IP by modifying environment variable VLLM_HOST_IP

Aug 09 '24 09:08 DC-Shi

@DC-Shi hello, i change the VLLM_HOST_IP to for example 0.0.0.0, but still fails. may i ask how do you change the VLLM_HOST_IP?

Sep 12 '24 07:09 eyuansu62

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Dec 12 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jan 12 '25 02:01 github-actions[bot]

Hi have you solved the problem?

Jul 15 '25 19:07 yinghy18

[Bug]: 单gpu没有任何反应（设置tensor_parallel_size=1模型加载失败）

Your current environment

🐛 Describe the bug