vllm [Performance]: When deployed DeepSeek-V3 on 8*H20(96GB), maximum model length only reaches 6500 using vllm, but with sglang can achieve 163840.

[Performance]: When deployed DeepSeek-V3 on 8*H20(96GB), maximum model length only reaches 6500 using vllm, but with sglang can achieve 163840.

Open zhaotyer opened this issue 3 weeks ago • 8 comments

Proposal to improve performance

No response

Report of performance regression

No response

Misc discussion on performance

vllm command python3 -m vllm.entrypoints.openai.api_server --model ${model_path} --port 8108 --max-model-len 6500 --gpu-memory-utilization 0.98 --tensor-parallel-size 8 --trust-remote-code --quantization fp8 sglang command python3 -m sglang.launch_server --model-path ${model_path} --tp 8 --trust-remote-code --mem-fraction-static 0.9 --host 0.0.0.0 --port 50050 --served-model-name atom

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

[x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Feb 07 '25 11:02 zhaotyer

vllm vllm copied to clipboard

[Performance]: When deployed DeepSeek-V3 on 8*H20(96GB), maximum model length only reaches 6500 using vllm, but with sglang can achieve 163840.

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

vllm
vllm copied to clipboard