vllm
vllm copied to clipboard
[Performance]: VLLM 请求数量过多时太慢
Your current environment
The output of `python collect_env.py`
How would you like to use vllm
我正在使用一张A100 部署的72B量化模型 这是启动脚本 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --max-model-len 9000 --served-model-name chat-yzq --model /workspace/chat-v1-Int4 --enforce-eager --tensor-parallel-size 1 --gpu-memory-utilization 0.85
当1天有1万次请求时 回复会变得非常缓慢 有什么办法吗
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.