vllm [Bug]: when i use docker vllm/vllm-openai:v0.7.2 to deploy r1 awq, i got empty content

Your current environment

The output of `python collect_env.py`

Your output of `python collect_env.py` here

🐛 Describe the bug

device ：8 * H100 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 23333 --max-model-len 60000 --trust-remote-code --tensor-parallel-size 8 --quantization moe_wna16 --gpu-memory-utilization 0.92 --kv-cache-dtype fp8_e5m2 --calculate-kv-scales --served-model-name deepseek-reasoner --model cognitivecomputations/DeepSeek-R1-AWQ

curl http://localhost:23333/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "deepseek-reasoner", "messages": [ {"role": "user", "content": "你是谁"} ], "stream":true, "temperature":1.2 }' data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-c7e88282efa547cfba27b429df7df593","object":"chat.completion.chunk","created":1739440234,"model":"deepseek-reasoner","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":null}]}

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Feb 13 '25 11:02 linyubupa

Hello, I got the same problem with 2 nodes of 4*A100 80G. Did you find a solution now?

Feb 15 '25 08:02 dailingcs

same here on 8*A800 80G

Feb 18 '25 03:02 ericg108

vllm/vllm-openai:latest.

success on 8*A800 80G.

VLLM_WORKER_MULTIPROC_METHOD=spawn` vllm serve /cognitivecomputations/DeepSeek-R1-AWQ --host 0.0.0.0 --port 12345 --max-model-len 16384 --max-num-batched-tokens 16384 --trust-remote-code --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --dtype float16 --enable-reasoning --reasoning-parser deepseek_r1 --served-model-name deepseek-reasoner  --enforce-eager

However, MLA is not supported. Output about 5 tokens per second.

Feb 20 '25 05:02 JJJJerry

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

May 22 '25 02:05 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jun 22 '25 02:06 github-actions[bot]