vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Integrate `flash-infer` FP8 KV Cache Chunked-Prefill (Append Attention)

Open jon-chuang opened this issue 1 year ago • 6 comments

🚀 The feature, motivation and pitch

From new Flash Infer Release https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.1.4

cc @comaniac

Additional context

Follow up to: https://github.com/vllm-project/vllm/pull/7208, https://github.com/vllm-project/vllm/pull/7185

jon-chuang avatar Aug 13 '24 00:08 jon-chuang

Actually, @comaniac, I noticed that there are explicit asserts forbidding use of flash infer kernels for chunked prefill https://github.com/vllm-project/vllm/blob/774cd1d3bf7890c6abae6c7ace798c4a376b2b20/vllm/attention/backends/flashinfer.py#L195

As pointed out in: https://github.com/flashinfer-ai/flashinfer/issues/392#issuecomment-2246997216


My understanding is that this is because vLLM runs prefill and decode in two separate kernel invocations by default (as is the case for flash-attention, see: https://github.com/vllm-project/vllm/pull/6052), and this applies to flash-infer as well?

Perhaps the first step is to unify the flash infer kernels to use a single kernel, similar to https://github.com/vllm-project/vllm/pull/6052, or at least clarify in what scenario it is ok to run flash-infer kernels for chunked prefill, because according to @yzh119 in https://github.com/flashinfer-ai/flashinfer/issues/392, this should be supported by flash-infer already.

jon-chuang avatar Aug 13 '24 00:08 jon-chuang

Anw, please assign it to me, I will investigate further

jon-chuang avatar Aug 13 '24 00:08 jon-chuang

We are already working on this cc @Yard1

comaniac avatar Aug 13 '24 00:08 comaniac

@comaniac Any updates or open PRs on this that we can take a look at?

pavanimajety avatar Sep 05 '24 23:09 pavanimajety

@comaniac Any updates?

taegeonum avatar Nov 20 '24 13:11 taegeonum

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Feb 19 '25 01:02 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Mar 21 '25 02:03 github-actions[bot]