[Feature]: Is there any plan to support Cross-Layer Attention (CLA) ?
The Cross-Layer Attention (CLA) proposed by MIT recently can significantly reduce runtime memory usage. Does vLLM have any plans to support it? Thanks!
Cross-Layer Attention paper: https://arxiv.org/abs/2405.12981
This method is interesting and I believe pretty effective overall (also see https://research.character.ai/optimizing-inference/)
However, it seems like currently it require model trained for this techniques. We would love to see a model trained for this and it should be straightforward to add!
This method is interesting and I believe pretty effective overall (also see https://research.character.ai/optimizing-inference/)
However, it seems like currently it require model trained for this techniques. We would love to see a model trained for this and it should be straightforward to add!
@simon-mo Thank you for your reply! It seems that to implement CLA, we only need to do a few modifications to the current vllm code. I've listed them below. I'm not sure if my idea will work, is there anything I missed? Thanks!
-
When calculating
cache_block_size, takesharing_factorinto consideration: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/worker/cache_engine.py#L117 -
Change CacheEngine's
self.num_layersfromnum_attention_layerstonum_attention_layers//sharing_factor: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/worker/cache_engine.py#L37-L38 -
When using
kv_cachewithin the model, correctly handle the index ofkv_cachepassed to each layer. -
Allow the input
keyandvaluefor attention to beNone. In this case, the operation of writingkeyandvalueintokv cachewill no longer be executed, like here: https://github.com/vllm-project/vllm/blob/f7160d946a0a07703e72d81ba9ecf3913f192605/vllm/attention/backends/flash_attn.py#L295-L302 -
There's no need to modify
attn_metadata, because the content ofattn_metadatais exactly the same for each layer.
You are correct. The change will be small because just need to enable cross layer sharing, and prevent writing to the cache in later layers. However, I would like to hold off on this until there an open source model (or fine-tunes that support this), so we can actually test the output correctly.
@simon-mo @JiayiFeng hi, guys. Transformers with CLA-like attentions have gained some popularity in the community. See https://arxiv.org/abs/2405.05254 . Hope vllm can support it(enabling sharing kv cache for all backends).
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!