sglang [Bug] deploy the DeepSeek-R1-awq get jumbled or nonsensical answers

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[ ] 2. The bug has not been fixed in the latest version.
[ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
[ ] 5. Please use English, otherwise it will be closed.

Describe the bug

When I'm using sglang to deploy the DeepSeek-R1-awq model (cognitivecomputations/DeepSeek-R1-AWQ), after the service starts up, it gives jumbled or nonsensical answers when I ask questions. How should I handle this problem?

Reproduction

docker run -d --gpus all \
	--shm-size 32g \
	--name sgl-r1 \
	-p 30000:30000 \
	-v /data/model:/models \
	--ipc=host \
	lmsysorg/sglang:v0.4.3-cu124-srt \
	python3 -m sglang.launch_server \
	--model /models/DeepSeek-R1-awq \
	--tp 8 \
	--enable-p2p-check \
	--trust-remote-code \
	--port 30000 \
	--dtype float16 \
	--host 0.0.0.0 \
	--mem-fraction-static 0.9 \
	--served-model-name deepseek-r1-awq \
	--max-running-requests 32

Environment

1 node A800*8

Feb 14 '25 10:02 goatbjh

我用的是两台L40S推理，出现同样的错误，而且必须关闭MLA，否则会超出显卡的物理共享内存。

Feb 15 '25 04:02 su400

用VLLM是正常的，按模型内的说明选择量化参数就好。

Feb 16 '25 07:02 su400

用VLLM是正常的，按模型内的说明选择量化参数就好。

bro，请问现在有找到解决办法吗？我也碰到了这个胡言乱语的情况了

Feb 20 '25 01:02 gaopijian

请问你这个部署方案短prompt的generate速度大约是多少？

Feb 24 '25 03:02 Grey4sh

I can confirm that this did occur.

I'm deploying with the following command:

uv run \
    --python 3.12 \
    --with sglang[all] \
    --with transformers==4.48.3 \
    python -m sglang.launch_server \
    --model-path $MODEL_PATH \
    --trust-remote-code \
    --tp 8 \
    --dtype half \
    --mem-fraction-static 0.75 \
    --cuda-graph-max-bs 32 \
    --context-length 153872 \
    --host 0.0.0.0 \
    --port 30000 \
    --served-model-name deepseek-r1-awq

(P.S. the mem-fraction-static must be set under 0.8 and cuda-graph-max-bs must be set to 32 to successfully build cuda graph, which I think is worth mentioning in the doc)

Output starts out normal, then turns into gibberish:

Besides, the TPS is only about 11, much lower than expected:

I assumed this might be due to the context-length option, so I lowered it into 16384, but the problem persists.

I tried other deployment methods and seem to be able to get normal results and the TPS tops out at around 35, proving that the model itself should be fine.

@hnyls2002 Thank you for your hard work! Would you kindly provide an update on the progress regarding this bug?

Feb 27 '25 08:02 observerw

I can confirm that this did occur.

I'm deploying with the following command:

uv run
--python 3.12
--with sglang[all]
--with transformers==4.48.3
python -m sglang.launch_server
--model-path $MODEL_PATH
--trust-remote-code
--tp 8
--dtype half
--mem-fraction-static 0.75
--cuda-graph-max-bs 32
--context-length 153872
--host 0.0.0.0
--port 30000
--served-model-name deepseek-r1-awq (P.S. the mem-fraction-static must be set under 0.8 and cuda-graph-max-bs must be set to 32 to successfully build cuda graph, which I think is worth mentioning in the doc)

Output starts out normal, then turns into gibberish:
Besides, the TPS is only about 11, much lower than expected: I assumed this might be due to the `context-length` option, so I lowered it into `16384`, but the problem persists.
I tried other deployment methods and seem to be able to get normal results and the TPS tops out at around 35, proving that the model itself should be fine.

@hnyls2002 Thank you for your hard work! Would you kindly provide an update on the progress regarding this bug?

May I ask what tool was used to deploy TP top to 35？

Mar 04 '25 11:03 myhostone1990

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

May 04 '25 00:05 github-actions[bot]