[Usage]: Question Regarding VLLM Rate Limit

Open zhongxifang opened this issue 1 year ago • 0 comments

Your current environment

[email protected] vllm==0.6.3 torch==2.4.0

How would you like to use vllm

Hello,

I have a question regarding the VLLM rate limit. I am running the Qwen-2.5 32B model on an A100 80GB * 2 setup. Specifically, I am using VLLM to set up the server and sending several hundred queries asynchronously from the front-end. Each query contains around 4000 to 5000 tokens.

The issue I am encountering is that only a portion of the queries are being processed. Could this be related to the VLLM rate limit? I would appreciate any guidance on the appropriate approach to handle this.

Thank you.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Oct 18 '24 02:10 zhongxifang