vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[FEATURE] Implement Dynamic SplitFuse

Open casper-hansen opened this issue 1 year ago • 3 comments

Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),

DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:

  • Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.

  • Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.

Code: https://github.com/microsoft/DeepSpeed-MII Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen

Llama 13B (1x A100-80GB): image

Llama 70B (4x A100x80GB with TP): image

casper-hansen avatar Nov 04 '23 14:11 casper-hansen

LGTM

irasin avatar Nov 14 '23 13:11 irasin

Hi, is there any progress right now?

thesues avatar Dec 20 '23 02:12 thesues

Do we have an ETA? 😊

shixianc avatar Jan 07 '24 18:01 shixianc

Hi @WoosukKwon @zhuohan123

The absence of a chunked prefill implementation in vllm is a major blocker. Any kind of timeline or regular communication on progress towards a chunked prefill implementation would be immensely helpful, just to allow for future planning.

tdene avatar Feb 20 '24 06:02 tdene

Keeping a batch with aligned length definitely helps https://github.com/vllm-project/vllm/pull/2357

sh1ng avatar Feb 29 '24 22:02 sh1ng

Looks like someone has started working on this: https://github.com/vllm-project/vllm/pull/3106

njhill avatar Feb 29 '24 23:02 njhill

Chunked prefill is now supported

hmellor avatar Jul 26 '24 10:07 hmellor