vllm
vllm copied to clipboard
[Usage]: Question Regarding VLLM Rate Limit
Your current environment
[email protected] vllm==0.6.3 torch==2.4.0
How would you like to use vllm
Hello,
I have a question regarding the VLLM rate limit. I am running the Qwen-2.5 32B model on an A100 80GB * 2 setup. Specifically, I am using VLLM to set up the server and sending several hundred queries asynchronously from the front-end. Each query contains around 4000 to 5000 tokens.
The issue I am encountering is that only a portion of the queries are being processed. Could this be related to the VLLM rate limit? I would appreciate any guidance on the appropriate approach to handle this.
Thank you.
Before submitting a new issue...
- [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.