vllm [Misc]: Server Does Not Follow Scheduler Policy

[Misc]: Server Does Not Follow Scheduler Policy

Open Bojun-Feng opened this issue 9 months ago • 1 comments

Anything you want to discuss about vllm.

I was testing out vLLM on Colab and notices something weird. It seems from the code that vLLM is using first come first serve order policy:

https://github.com/vllm-project/vllm/blob/7038e8b80303bf6128acbe508dec910183a1be56/vllm/core/scheduler.py#L729 https://github.com/vllm-project/vllm/blob/7038e8b80303bf6128acbe508dec910183a1be56/vllm/core/policy.py#L29-L36

However, When I was running the OpenAI compatible vLLM server, I sent in orders in sequence and found the server to not follow the first come first serve policy. Instead, they seemed random? Here is a example jupyter notebook replicating the issue: https://colab.research.google.com/drive/1mMPTZiKJoQEsvjBjNUGttsbp9L1F9zXm?usp=sharing

Is there some optimization I missed that optimized the order of inputs? I am a bit confused on what controls the server's output order. Any advice would be appreciated, thanks!

May 02 '24 17:05 Bojun-Feng

The input and output length per request also matter, concretely, as of right now:

The core logic is in the schedule() function as following (existing logic without prefix caching or chunked prefill or spec decode):
- We have two scheduling limits. When these are reached, no new requests are processed until any sequence group finishes processing.
  - Maximum number of batched requests
    - Equivalent to the number of sequences. (not sequence group).
    - This flag is used to control the latency/throughput trade off. Small = low throughput, low latency.
  - Maximum number of batched tokens
    - For prefill, it will be the length of the prompt
    - For decode, it will be 1 per sequence, n per sequence group.
    - This flag is used to control memory.
- First, we schedule prefill requests in the waiting queue.
  - If there is any prefill request, we will just run prefill requests first.
  - The rationale: TTFT optimal, and maximal batch size. Downside is high inter-token latency.
- Second, we schedule all the running requests
  - If we cannot run some of the requests due to memory constraints, we will swap some of the running requests out.
- Third, we schedule all the swapped out requests

(this is subject to change)

May 02 '24 17:05 simon-mo

vllm vllm copied to clipboard

[Misc]: Server Does Not Follow Scheduler Policy

Anything you want to discuss about vllm.

vllm
vllm copied to clipboard