transformers icon indicating copy to clipboard operation
transformers copied to clipboard

KV cache optimization with paged attention

Open liangan1 opened this issue 2 years ago • 9 comments

Feature request

Paged attention has been enabled by a lot of server engine, e.g., vllm, tensorrt-llm

Motivation

KV cache is used to reduce computation for Decoder layer but it also bring memory overheads, for example, when we use beam search, the kv_cache should be reordered according to latest beam idx and the current key/value should also be concat with kv_cache in the attention layer to get entire context to do scale dot product. When the sequence is very long, the memory overhead will be performance bottleneck.

Your contribution

No PR yet

liangan1 avatar Nov 06 '23 06:11 liangan1

cc @gante (I think this is closest to your work - sorry if wrong! )

amyeroberts avatar Nov 06 '23 11:11 amyeroberts

@jgong5

liangan1 avatar Nov 07 '23 02:11 liangan1

Hi @liangan1 👋

We are close to introducing a new cache abstraction (https://github.com/huggingface/transformers/pull/26681). I believe that, after this PR is merged, adding paged attention would become directly applicable on top of it :)

Would you be interested in adding it to transformers?

gante avatar Nov 07 '23 13:11 gante

Hi @liangan1 👋

We are close to introducing a new cache abstraction (#26681). I believe that, after this PR is merged, adding paged attention would become directly applicable on top of it :)

Would you be interested in adding it to transformers?

Sure. We are pleasure to contribute more kv_cache related optimizations.

liangan1 avatar Nov 09 '23 10:11 liangan1

Awesome, I will let you know when the cache abstraction is ready!

gante avatar Nov 09 '23 16:11 gante

Thanks.

liangan1 avatar Nov 10 '23 01:11 liangan1

@liangan1 the cache abstraction will be merged today, so you can start working on top of it. Happy to provide pointers and suggestions! 🙌

gante avatar Dec 07 '23 15:12 gante

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jan 01 '24 08:01 github-actions[bot]

As of the latest release if flash attention v2.5 paged kv cache is now supported. https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#25-paged-kv-cache. This being implemented into transformers would be pretty awesome, specially when it can stack with quantized kv cache, allowing for more than 100,00k tokens on consumer gpu’s, if you have 64gb of shared memory then like 500,000 tokens of context, on a 7b 4bit model.

NicolasMejiaPetit avatar Jun 28 '24 14:06 NicolasMejiaPetit

@gante hi Joao, I am wondering if there are plans to implement better scheduling in GenerationMixin.generate in case a large input batch is passed (larger than what can be processed at a time), and in cases some sequences are finished earlier than other?

Or in the near future, should we expect GenerationMixin.generate to keep processing dummy tokens for finished sequences in a batch, and not attempting to maximize batch size in case of many requests?

fxmarty-amd avatar Feb 19 '25 18:02 fxmarty-amd