[Bug] deploy the DeepSeek-R1-awq get jumbled or nonsensical answers
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [ ] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [ ] 5. Please use English, otherwise it will be closed.
Describe the bug
When I'm using sglang to deploy the DeepSeek-R1-awq model (cognitivecomputations/DeepSeek-R1-AWQ), after the service starts up, it gives jumbled or nonsensical answers when I ask questions. How should I handle this problem?
Reproduction
docker run -d --gpus all \
--shm-size 32g \
--name sgl-r1 \
-p 30000:30000 \
-v /data/model:/models \
--ipc=host \
lmsysorg/sglang:v0.4.3-cu124-srt \
python3 -m sglang.launch_server \
--model /models/DeepSeek-R1-awq \
--tp 8 \
--enable-p2p-check \
--trust-remote-code \
--port 30000 \
--dtype float16 \
--host 0.0.0.0 \
--mem-fraction-static 0.9 \
--served-model-name deepseek-r1-awq \
--max-running-requests 32
Environment
1 node A800*8
我用的是两台L40S推理,出现同样的错误,而且必须关闭MLA,否则会超出显卡的物理共享内存。
用VLLM是正常的,按模型内的说明选择量化参数就好。
用VLLM是正常的,按模型内的说明选择量化参数就好。
bro,请问现在有找到解决办法吗?我也碰到了这个胡言乱语的情况了
请问你这个部署方案短prompt的generate速度大约是多少?
I can confirm that this did occur.
I'm deploying with the following command:
uv run \
--python 3.12 \
--with sglang[all] \
--with transformers==4.48.3 \
python -m sglang.launch_server \
--model-path $MODEL_PATH \
--trust-remote-code \
--tp 8 \
--dtype half \
--mem-fraction-static 0.75 \
--cuda-graph-max-bs 32 \
--context-length 153872 \
--host 0.0.0.0 \
--port 30000 \
--served-model-name deepseek-r1-awq
(P.S. the mem-fraction-static must be set under 0.8 and cuda-graph-max-bs must be set to 32 to successfully build cuda graph, which I think is worth mentioning in the doc)
Output starts out normal, then turns into gibberish:
Besides, the TPS is only about 11, much lower than expected:
I assumed this might be due to the context-length option, so I lowered it into 16384, but the problem persists.
I tried other deployment methods and seem to be able to get normal results and the TPS tops out at around 35, proving that the model itself should be fine.
@hnyls2002 Thank you for your hard work! Would you kindly provide an update on the progress regarding this bug?
I can confirm that this did occur.
I'm deploying with the following command:
uv run
--python 3.12
--with sglang[all]
--with transformers==4.48.3
python -m sglang.launch_server
--model-path $MODEL_PATH
--trust-remote-code
--tp 8
--dtype half
--mem-fraction-static 0.75
--cuda-graph-max-bs 32
--context-length 153872
--host 0.0.0.0
--port 30000
--served-model-name deepseek-r1-awq (P.S. themem-fraction-staticmust be set under 0.8 andcuda-graph-max-bsmust be set to 32 to successfully build cuda graph, which I think is worth mentioning in the doc)Output starts out normal, then turns into gibberish:
Besides, the TPS is only about 11, much lower than expected:
I assumed this might be due to the `context-length` option, so I lowered it into `16384`, but the problem persists.
I tried other deployment methods and seem to be able to get normal results and the TPS tops out at around 35, proving that the model itself should be fine.
@hnyls2002 Thank you for your hard work! Would you kindly provide an update on the progress regarding this bug?
May I ask what tool was used to deploy TP top to 35?
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.
Besides, the TPS is only about 11, much lower than expected:
I assumed this might be due to the `context-length` option, so I lowered it into `16384`, but the problem persists.