vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Unblock LLM while handling long sequences / Handling multiple prefills at the same time

Open schoennenbeck opened this issue 2 months ago • 8 comments

🚀 The feature, motivation and pitch

Motivation

If an engine is currently handling a single long sequence in the prefill stage any other incoming sequence has to wait untill the LLM is done with the long one before it gets to be handled. This means that in a situation with multiple users it can easily happen that a single user's ill-conceived (or simply long) request makes the LLM unresponsive for all other users.

Initial ideas

There are a couple of ways one can currently approach this.

  • Simply accepting this fact. We do first come first serve and people have to wait.
  • Upscale the LLM and host multiple replicas or scale a single replica over multiple GPUs to alleviate this a little bit.
  • Use priority scheduling and give longer requests lower priority

However, most of these ideas either come with their own problems and/or don't actually solve the problem.

Suggestion

I don't know of any approach that would work without chunked prefill. However, if we do do chunked prefill the following approach could work:

  • Introduce a new parameter to the engine min_num_concurrent_sequences (with default set to 1 which is just the current behaviour)
  • While scheduling, first schedule decodes (as is currently the case) as these take only a single available token from the budget.
  • The remaining tokens are now spread over enough chunked prefills to make sure there are at least min_num_concurrent_sequences that are handled during the next step (or if there aren't enough total sequences that all are handled).

Example

Say min_num_concurrent_sequences=2, max_num_batched_tokens=512 and we have two sequences with 8000 and 300 tokens respectively. Then we would do chunked prefill for both sequences with 256 tokens each.

Expected result

Implementing this would mean that no single user could block other users from getting their answers in a timely manner. Clearly the long sequence would now take longer to be handled but it would make for a little fairer handling of requests. It is still very much possible to get slow answers if the LLM is under high load but we could service more users at the same time at the cost of higher ITL for each user individually which I personally think is, in a lot of cases, prefarable to one user being serviced fast while everybody else has to wait.

Call for comments

I am currently trying my hands at a prototype implementation (basically because I need this for my use case) but it is hardly trivial. Any thoughts, comments and suggestions are welcome.

Before submitting a new issue...

  • [X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

schoennenbeck avatar Nov 29 '24 10:11 schoennenbeck