vllm
vllm copied to clipboard
[Feature]: Implement Priority Scheduling In V1 Engine
🚀 The feature, motivation and pitch
In V0, we support request priority. I would like to see this in V1
cc @WoosukKwon
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
I’ll take this and start working on it. :)
I have come up with two possible approaches. I would really appreciate guidance and feedback on the best way to proceed. @robertgshaw2-redhat cc @WoosukKwon
Option 1: Swap after scheduling RUNNING
- Logic:
After schedulingRUNNINGrequests, skip the originalWAITINGqueue logic.
Instead, swap the highest-priority request fromWAITINGwith the lowest-priority request inRUNNING. - Pros:
- Simple and easy implementation.
- Cons:
- Potential performance overhead due to repeated scheduling of requests already in the
RUNNINGqueue.
- Potential performance overhead due to repeated scheduling of requests already in the
Option 2: Merged queues before scheduling
- Logic:
Completely skip original logic.
MergeRUNNINGandWAITINGqueues first, then schedule based on priority. - Pros:
- Avoids repeated scheduling of
RUNNINGqueue requests.
- Avoids repeated scheduling of
- Cons:
- More complex code.
Seems like @Chen-0210 has not finished designing this. I will try to come up with an API design this month for this.
@Chen-0210 @Hanchenli I am also interested in this feature. Do you want to work together for this feature?
@xavier-h-10 I'll send a design doc on what to change before next Tuesday and maybe we can discuss from there? In my opinion it should be something similar to option 2 that Chen proposed. This is similar to the v0 priority schedule design, where the requests are swapped between running queue and waiting queue based on priority. And this could be potentially be performed by adding a function inside the main schedule function (not sure about that, need to double think).
@xavier-h-10 @Hanchenli Sure, let’s discuss this further. I think Option 1 is closer to the V0 implementation? Both options might lead to quite a bit of duplicated code related to LoRA, encoder, and others. Since @WoosukKwon is currently refactoring the scheduling module, I think it would be better to align with the new interface he’s working on.
@Chen-0210 @Hanchenli Thank you for your reply! Let's sync with @WoosukKwon and discuss which option is more feasible next week?
I am thinking of scheduling as 1. request priority when request priority are the same, we do 2. FCFS. I am thinking about two designs.
First one is changing Scheduler.running into a priority queue with request priority as the primary key and coming time as the second key. This gives flexibility but might incur more cost during scheduling time.
The second one is changing Scheduler.running into a list of lists, where the running[i] is a FCFS list for requests with priority 0. This is less elegant and less extendible to me, but when there is only one priority (the mainstream use case of vLLM), we could just always schedule on running[0] and this will be almost the same as the current design.
I personally prefer the second option. Would like to see what you guys think.
I've submitted a PR