vllm
vllm copied to clipboard
[Misc]: Server Does Not Follow Scheduler Policy
Anything you want to discuss about vllm.
I was testing out vLLM on Colab and notices something weird. It seems from the code that vLLM is using first come first serve order policy:
https://github.com/vllm-project/vllm/blob/7038e8b80303bf6128acbe508dec910183a1be56/vllm/core/scheduler.py#L729 https://github.com/vllm-project/vllm/blob/7038e8b80303bf6128acbe508dec910183a1be56/vllm/core/policy.py#L29-L36
However, When I was running the OpenAI compatible vLLM server, I sent in orders in sequence and found the server to not follow the first come first serve policy. Instead, they seemed random? Here is a example jupyter notebook replicating the issue: https://colab.research.google.com/drive/1mMPTZiKJoQEsvjBjNUGttsbp9L1F9zXm?usp=sharing
Is there some optimization I missed that optimized the order of inputs? I am a bit confused on what controls the server's output order. Any advice would be appreciated, thanks!
The input and output length per request also matter, concretely, as of right now:
- The core logic is in the schedule() function as following (existing logic without prefix caching or chunked prefill or spec decode):
- We have two scheduling limits. When these are reached, no new requests are processed until any sequence group finishes processing.
- Maximum number of batched requests
- Equivalent to the number of sequences. (not sequence group).
- This flag is used to control the latency/throughput trade off. Small = low throughput, low latency.
- Maximum number of batched tokens
- For prefill, it will be the length of the prompt
- For decode, it will be 1 per sequence, n per sequence group.
- This flag is used to control memory.
- Maximum number of batched requests
- First, we schedule prefill requests in the waiting queue.
- If there is any prefill request, we will just run prefill requests first.
- The rationale: TTFT optimal, and maximal batch size. Downside is high inter-token latency.
- Second, we schedule all the running requests
- If we cannot run some of the requests due to memory constraints, we will swap some of the running requests out.
- Third, we schedule all the swapped out requests
- We have two scheduling limits. When these are reached, no new requests are processed until any sequence group finishes processing.
(this is subject to change)