vllm icon indicating copy to clipboard operation
vllm copied to clipboard

How does this compare to MQA (multi-query attention)?

Open xpl opened this issue 2 years ago • 4 comments

https://arxiv.org/abs/1911.02150

For example, StarCoder uses MQA to speed up inference. How does PagedAttention compare to Multi-Query Attention? Are they compatible?

xpl avatar Jun 20 '23 21:06 xpl

Thanks for your interest! PagedAttention is more like an implementation of an attention algorithm. Thus, it is also applicable to MQA and can save a lot of memory waste. We are indeed planning to add MQA-based models such as StarCoder and Falcon. Please stay tuned.

WoosukKwon avatar Jun 20 '23 21:06 WoosukKwon

Really looking forward to getting Starcoder support!

xpl avatar Jun 21 '23 13:06 xpl

@xpl vLLM now supports StarCoder thanks to @michaelfeil. Please try it out!

WoosukKwon avatar Jun 22 '23 18:06 WoosukKwon

I'll keep this issue though, as we haven't got the "efficient" implementation of MQA yet.

WoosukKwon avatar Jun 22 '23 18:06 WoosukKwon

Fixed by #452

zhuohan123 avatar Jul 16 '23 21:07 zhuohan123