vllm
vllm copied to clipboard
[Performance]: When deployed DeepSeek-V3 on 8*H20(96GB), maximum model length only reaches 6500 using vllm, but with sglang can achieve 163840.
Proposal to improve performance
No response
Report of performance regression
No response
Misc discussion on performance
vllm command
python3 -m vllm.entrypoints.openai.api_server --model ${model_path} --port 8108 --max-model-len 6500 --gpu-memory-utilization 0.98 --tensor-parallel-size 8 --trust-remote-code --quantization fp8
sglang command
python3 -m sglang.launch_server --model-path ${model_path} --tp 8 --trust-remote-code --mem-fraction-static 0.9 --host 0.0.0.0 --port 50050 --served-model-name atom
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.