vllm
vllm copied to clipboard
[FEATURE] Implement Dynamic SplitFuse
Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),
DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:
-
Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.
-
Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.
Code: https://github.com/microsoft/DeepSpeed-MII Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen
Llama 13B (1x A100-80GB):
Llama 70B (4x A100x80GB with TP):
LGTM
Hi, is there any progress right now?
Do we have an ETA? 😊
Hi @WoosukKwon @zhuohan123
The absence of a chunked prefill implementation in vllm is a major blocker. Any kind of timeline or regular communication on progress towards a chunked prefill implementation would be immensely helpful, just to allow for future planning.
Keeping a batch with aligned length definitely helps https://github.com/vllm-project/vllm/pull/2357
Looks like someone has started working on this: https://github.com/vllm-project/vllm/pull/3106
Chunked prefill is now supported