vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Implement Priority Scheduling In V1 Engine

Open robertgshaw2-redhat opened this issue 9 months ago • 6 comments

🚀 The feature, motivation and pitch

In V0, we support request priority. I would like to see this in V1

cc @WoosukKwon

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

robertgshaw2-redhat avatar Feb 28 '25 01:02 robertgshaw2-redhat

I’ll take this and start working on it. :)

Chen-0210 avatar Mar 03 '25 01:03 Chen-0210

I have come up with two possible approaches. I would really appreciate guidance and feedback on the best way to proceed. @robertgshaw2-redhat cc @WoosukKwon

Option 1: Swap after scheduling RUNNING

  • Logic:
    After scheduling RUNNING requests, skip the original WAITING queue logic.
    Instead, swap the highest-priority request from WAITING with the lowest-priority request in RUNNING.
  • Pros:
    • Simple and easy implementation.
  • Cons:
    • Potential performance overhead due to repeated scheduling of requests already in the RUNNING queue.

Image

Option 2: Merged queues before scheduling

  • Logic:
    Completely skip original logic.
    Merge RUNNING and WAITING queues first, then schedule based on priority.
  • Pros:
    • Avoids repeated scheduling of RUNNING queue requests.
  • Cons:
    • More complex code.

Image

Chen-0210 avatar Mar 11 '25 03:03 Chen-0210

Seems like @Chen-0210 has not finished designing this. I will try to come up with an API design this month for this.

Hanchenli avatar Apr 01 '25 22:04 Hanchenli

@Chen-0210 @Hanchenli I am also interested in this feature. Do you want to work together for this feature?

xavier-h-10 avatar Apr 11 '25 16:04 xavier-h-10

@xavier-h-10 I'll send a design doc on what to change before next Tuesday and maybe we can discuss from there? In my opinion it should be something similar to option 2 that Chen proposed. This is similar to the v0 priority schedule design, where the requests are swapped between running queue and waiting queue based on priority. And this could be potentially be performed by adding a function inside the main schedule function (not sure about that, need to double think).

Hanchenli avatar Apr 11 '25 16:04 Hanchenli

@xavier-h-10 @Hanchenli Sure, let’s discuss this further. I think Option 1 is closer to the V0 implementation? Both options might lead to quite a bit of duplicated code related to LoRA, encoder, and others. Since @WoosukKwon is currently refactoring the scheduling module, I think it would be better to align with the new interface he’s working on.

Chen-0210 avatar Apr 12 '25 12:04 Chen-0210

@Chen-0210 @Hanchenli Thank you for your reply! Let's sync with @WoosukKwon and discuss which option is more feasible next week?

xavier-h-10 avatar Apr 13 '25 19:04 xavier-h-10

I am thinking of scheduling as 1. request priority when request priority are the same, we do 2. FCFS. I am thinking about two designs.

First one is changing Scheduler.running into a priority queue with request priority as the primary key and coming time as the second key. This gives flexibility but might incur more cost during scheduling time.

The second one is changing Scheduler.running into a list of lists, where the running[i] is a FCFS list for requests with priority 0. This is less elegant and less extendible to me, but when there is only one priority (the mainstream use case of vLLM), we could just always schedule on running[0] and this will be almost the same as the current design.

I personally prefer the second option. Would like to see what you guys think.

Hanchenli avatar Apr 24 '25 20:04 Hanchenli

I've submitted a PR

amitm02 avatar May 26 '25 07:05 amitm02

I've submitted a PR

re-submitted as https://github.com/vllm-project/vllm/pull/19057

amitm02 avatar Jun 03 '25 07:06 amitm02