How does this compare to MQA (multi-query attention)?
https://arxiv.org/abs/1911.02150
For example, StarCoder uses MQA to speed up inference. How does PagedAttention compare to Multi-Query Attention? Are they compatible?
Thanks for your interest! PagedAttention is more like an implementation of an attention algorithm. Thus, it is also applicable to MQA and can save a lot of memory waste. We are indeed planning to add MQA-based models such as StarCoder and Falcon. Please stay tuned.
Really looking forward to getting Starcoder support!
@xpl vLLM now supports StarCoder thanks to @michaelfeil. Please try it out!
I'll keep this issue though, as we haven't got the "efficient" implementation of MQA yet.
Fixed by #452