vllm
vllm copied to clipboard
[Usage]: vllm infer with 2 * Nvidia-L20, output repeat !!!!
Your current environment
vllm==0.6.1
How would you like to use vllm
Steps to reproduce This happens to Qwen2.5-32B-Instruct-AWQ The problem can be reproduced with the following steps:
start vllm service
exec python3 -m vllm.entrypoints.openai.api_server\
--served-model-name ${model_name}\
--model ./${model_name}\
--port 8890 \
--quantization awq \
--tensor-parallel-size 2 \
--enable_auto_tool_choice \
--tool-call-parser hermes 1>vllm.log 2>&1 &
infer
system_prompt = "你是一个安全专家,你的任务是根据用户的输入,以markdown格式返回结果。"
query = "什么是SSRF漏洞"
messages = [{"role": "system", "content": system_prompt}]
messages.append({"role": "user", "content": query})
completion = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=0.1,
top_p=0.9,
max_tokens=4096,
tools=[],
extra_body={
"repetition_penalty": 1.05,
},
)
req_id = completion.id
total_token = completion.usage.total_tokens
completion_token = completion.usage.completion_tokens
prompt_tokens = completion.usage.prompt_tokens
output results The results are expected to be ... !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
with 2 A30, I didn't succeed in reproducing. But with L20,nearly 30% can reproduce is the problem related to GPU type or cuda driver? Looking for help
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.