vllm [FEATURE] Implement Dynamic SplitFuse

[FEATURE] Implement Dynamic SplitFuse

Open casper-hansen opened this issue 1 year ago • 3 comments

Dear vLLM maintainers @WoosukKwon and @zhuohan123 (@Yard1),

DeepSpeed has released its serving framework which claims to be faster than vLLM. The main speedup comes from Dynamic SplitFuse which is a technique that does the following:

Long prompts are decomposed into much smaller chunks and scheduled across multiple forward passes (iterations) with only the final pass performing any generation.
Short prompts will be composed to exactly fill a target token budget. Even short prompts may be decomposed to ensure the budget is precisely met and the forward sizes are well-aligned.

Code: https://github.com/microsoft/DeepSpeed-MII Background: https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen

Llama 13B (1x A100-80GB):

Llama 70B (4x A100x80GB with TP):

Nov 04 '23 14:11 casper-hansen

LGTM

Nov 14 '23 13:11 irasin

Hi, is there any progress right now?

Dec 20 '23 02:12 thesues

Do we have an ETA? 😊

Jan 07 '24 18:01 shixianc

Hi @WoosukKwon @zhuohan123

The absence of a chunked prefill implementation in vllm is a major blocker. Any kind of timeline or regular communication on progress towards a chunked prefill implementation would be immensely helpful, just to allow for future planning.

Feb 20 '24 06:02 tdene

Keeping a batch with aligned length definitely helps https://github.com/vllm-project/vllm/pull/2357

Feb 29 '24 22:02 sh1ng

Looks like someone has started working on this: https://github.com/vllm-project/vllm/pull/3106

Feb 29 '24 23:02 njhill

Chunked prefill is now supported

Jul 26 '24 10:07 hmellor

vllm vllm copied to clipboard

[FEATURE] Implement Dynamic SplitFuse

vllm
vllm copied to clipboard