[Tracking issue] [Help wanted]: Multi-step scheduling follow-ups
Co-authored with @SolitaryThinker @Yard1 @rkooo567
We are landing multi-step scheduling (#7000) to amortize scheduling overhead for better ITL and throughput. Since the first version of multi-step scheduling doesn't work with some existing features, this issue tracks the progress to support them so that multi-step scheduling could become a common and practical feature in vLLM.
Performance
Chunked Prefill
It is tricky for multi-step scheduling to work with chunked prefill because of the following reasons:
- Chunked prefill schedules prefill and decode requests to the same batch.
- Prefill requests only need a few steps (at maximum
prompt_tokens / chunk_sizesteps), which could be much less than the configured multi-steps (i.e., 8). - We cannot turn a prefill request into a decode request without re-scheduling and re-preparing inputs.
As a result, we need a schedule policy to deal with prefill requests in multi-step scheduling. Here are 2 possible policies we could consider at this moment:
- Force Single Step: We force single step when there are prefill requests in a batch. This may work well for offline batching, but not good for online serving because new requests keep coming.
- Ignore Prefill: We ignore prefill requests since the second step, meaning that prefill requests do nothing in (k-1) steps. This may work better for online serving.
Since there's no single schedule policy that works for all scenarios, it's better to implement both approaches and let users configure. Also we may come up with better policies in the future, we need to make these policies pluggable.
The action items are:
- [ ] An interface / API to configure policies. @varun-sundar-rabindranath #8378
- [ ] Force single step policy. @varun-sundar-rabindranath #8378
- [ ] Ignore prefill policy. @varun-sundar-rabindranath #8378
- [ ] (For long context) Support multi-step chunked prefill.
Misc
- [x] remove num_steps argument https://github.com/vllm-project/vllm/pull/7000/files#r1718684239 (@SolitaryThinker )
- [x] Double check if last_sampled_token_ids can be removed (@SolitaryThinker )
- [ ] ADAG / SPMD integration
Functionality
- [x] ~~Support prefix caching (should work out of the box but just need to confirm) @comaniac~~
- [x] Support LLM engine (non-async) @alexm-neuralmagic #7789
- [x] Support abort requests (#7877).
- [ ] Early stopping: If % of requests in a batch reaches EOS or max model length before the end of n-steps, stop early.
- [ ] Streaming output tokens incrementally.
- [x] Support logprobs (in
_pythonize_sampler_output) @afeldman-nm #7652 - [ ] Support prompt logprobs. @afeldman-nm #8199
- [ ] Support guided decoding (+logits processors).
- [ ] Support speculative decoding.
- [ ] Support LoRA
Thanks cody!
cc @tlrmchlsmth @alexm-neuralmagic @afeldman-nm @varun-sundar-rabindranath
Additions for tracking. I will take up both of these. cc @zhuohan123
- [ ] remove num_steps argument https://github.com/vllm-project/vllm/pull/7000/files#r1718684239 (@SolitaryThinker )
- [ ] Double check if last_sampled_token_ids can be removed (@SolitaryThinker ) https://github.com/vllm-project/vllm/pull/7715
I think we can also try making it work with new spmd architecture, which can simplify code and improve performance especially for pp
- [ ] ADAG / SPMD integration
- [ ] Lora support
Is multi-step scheduling not supported with LoRA at all? Does that mean any LoRA requests that come in do not use the scheduling? cc @SolitaryThinker
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!
Changing to closed as complete as the work was done