vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Misc]: Server Does Not Follow Scheduler Policy

Open Bojun-Feng opened this issue 9 months ago • 1 comments

Anything you want to discuss about vllm.

I was testing out vLLM on Colab and notices something weird. It seems from the code that vLLM is using first come first serve order policy:

https://github.com/vllm-project/vllm/blob/7038e8b80303bf6128acbe508dec910183a1be56/vllm/core/scheduler.py#L729 https://github.com/vllm-project/vllm/blob/7038e8b80303bf6128acbe508dec910183a1be56/vllm/core/policy.py#L29-L36

However, When I was running the OpenAI compatible vLLM server, I sent in orders in sequence and found the server to not follow the first come first serve policy. Instead, they seemed random? Here is a example jupyter notebook replicating the issue: https://colab.research.google.com/drive/1mMPTZiKJoQEsvjBjNUGttsbp9L1F9zXm?usp=sharing

Is there some optimization I missed that optimized the order of inputs? I am a bit confused on what controls the server's output order. Any advice would be appreciated, thanks!

Bojun-Feng avatar May 02 '24 17:05 Bojun-Feng

The input and output length per request also matter, concretely, as of right now:

  • The core logic is in the schedule() function as following (existing logic without prefix caching or chunked prefill or spec decode):
    • We have two scheduling limits. When these are reached, no new requests are processed until any sequence group finishes processing.
      • Maximum number of batched requests
        • Equivalent to the number of sequences. (not sequence group).
        • This flag is used to control the latency/throughput trade off. Small = low throughput, low latency.
      • Maximum number of batched tokens
        • For prefill, it will be the length of the prompt
        • For decode, it will be 1 per sequence, n per sequence group.
        • This flag is used to control memory.
    • First, we schedule prefill requests in the waiting queue.
      • If there is any prefill request, we will just run prefill requests first.
      • The rationale: TTFT optimal, and maximal batch size. Downside is high inter-token latency.
    • Second, we schedule all the running requests
      • If we cannot run some of the requests due to memory constraints, we will swap some of the running requests out.
    • Third, we schedule all the swapped out requests

(this is subject to change)

simon-mo avatar May 02 '24 17:05 simon-mo