vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Is there any plan to support Cross-Layer Attention (CLA) ?

Open JiayiFeng opened this issue 1 year ago • 4 comments

The Cross-Layer Attention (CLA) proposed by MIT recently can significantly reduce runtime memory usage. Does vLLM have any plans to support it? Thanks!

Cross-Layer Attention paper: https://arxiv.org/abs/2405.12981

JiayiFeng avatar Jul 10 '24 13:07 JiayiFeng

This method is interesting and I believe pretty effective overall (also see https://research.character.ai/optimizing-inference/)

However, it seems like currently it require model trained for this techniques. We would love to see a model trained for this and it should be straightforward to add!

simon-mo avatar Jul 11 '24 14:07 simon-mo

This method is interesting and I believe pretty effective overall (also see https://research.character.ai/optimizing-inference/)

However, it seems like currently it require model trained for this techniques. We would love to see a model trained for this and it should be straightforward to add!

@simon-mo Thank you for your reply! It seems that to implement CLA, we only need to do a few modifications to the current vllm code. I've listed them below. I'm not sure if my idea will work, is there anything I missed? Thanks!

  1. When calculating cache_block_size, take sharing_factor into consideration: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/worker/cache_engine.py#L117

  2. Change CacheEngine's self.num_layers from num_attention_layers to num_attention_layers//sharing_factor: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/worker/cache_engine.py#L37-L38

  3. When using kv_cache within the model, correctly handle the index of kv_cache passed to each layer.

  4. Allow the input key and value for attention to be None. In this case, the operation of writing key and value into kv cache will no longer be executed, like here: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/attention/backends/flash_attn.py#L295-L302

  5. There's no need to modify attn_metadata, because the content of attn_metadata is exactly the same for each layer.

JiayiFeng avatar Jul 12 '24 09:07 JiayiFeng

You are correct. The change will be small because just need to enable cross layer sharing, and prevent writing to the cache in later layers. However, I would like to hold off on this until there an open source model (or fine-tunes that support this), so we can actually test the output correctly.

simon-mo avatar Jul 13 '24 05:07 simon-mo

@simon-mo @JiayiFeng hi, guys. Transformers with CLA-like attentions have gained some popularity in the community. See https://arxiv.org/abs/2405.05254 . Hope vllm can support it(enabling sharing kv cache for all backends).

RunningLeon avatar Sep 13 '24 12:09 RunningLeon

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Dec 13 '24 02:12 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Jan 13 '25 02:01 github-actions[bot]