vllm [Feature]: Is there any plan to support Cross-Layer Attention (CLA) ?

The Cross-Layer Attention (CLA) proposed by MIT recently can significantly reduce runtime memory usage. Does vLLM have any plans to support it? Thanks!

Cross-Layer Attention paper: https://arxiv.org/abs/2405.12981

Jul 10 '24 13:07 JiayiFeng

This method is interesting and I believe pretty effective overall (also see https://research.character.ai/optimizing-inference/)

However, it seems like currently it require model trained for this techniques. We would love to see a model trained for this and it should be straightforward to add!

Jul 11 '24 14:07 simon-mo

This method is interesting and I believe pretty effective overall (also see https://research.character.ai/optimizing-inference/)

However, it seems like currently it require model trained for this techniques. We would love to see a model trained for this and it should be straightforward to add!

@simon-mo Thank you for your reply! It seems that to implement CLA, we only need to do a few modifications to the current vllm code. I've listed them below. I'm not sure if my idea will work, is there anything I missed? Thanks!

When calculating cache_block_size, take sharing_factor into consideration: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/worker/cache_engine.py#L117
Change CacheEngine's self.num_layers from num_attention_layers to num_attention_layers//sharing_factor: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/worker/cache_engine.py#L37-L38
When using kv_cache within the model, correctly handle the index of kv_cache passed to each layer.
Allow the input key and value for attention to be None. In this case, the operation of writing key and value into kv cache will no longer be executed, like here: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/attention/backends/flash_attn.py#L295-L302
There's no need to modify attn_metadata, because the content of attn_metadata is exactly the same for each layer.

Jul 12 '24 09:07 JiayiFeng

You are correct. The change will be small because just need to enable cross layer sharing, and prevent writing to the cache in later layers. However, I would like to hold off on this until there an open source model (or fine-tunes that support this), so we can actually test the output correctly.

Jul 13 '24 05:07 simon-mo

@simon-mo @JiayiFeng hi, guys. Transformers with CLA-like attentions have gained some popularity in the community. See https://arxiv.org/abs/2405.05254 . Hope vllm can support it(enabling sharing kv cache for all backends).

Sep 13 '24 12:09 RunningLeon

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

Dec 13 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

Jan 13 '25 02:01 github-actions[bot]