FasterTransformer
FasterTransformer copied to clipboard
Are MQA and GQA in development?
Hi Experts,
Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED?
Currently using MQA, GQA for modeling:
- Llama2 (GQA)
- ChatGLM2-6B
- Falcon
- SantaCoder, StarCoder
Any comments will be appreciated.
GQA has been supported by LMDeploy, which is developed based on FasterTransformer
mark +1
Hi Experts,
Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED?
Currently using MQA, GQA for modeling:
- Llama2 (GQA)
- ChatGLM2-6B
- Falcon
- SantaCoder, StarCoder
Any comments will be appreciated.
llama 2 is all GQA?
Only 70B model is GQA.
Hi Experts, Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED? Currently using MQA, GQA for modeling:
- Llama2 (GQA)
- ChatGLM2-6B
- Falcon
- SantaCoder, StarCoder
Any comments will be appreciated.
llama 2 is all GQA?
34B and 70B
Only 70B model is GQA.
Got it
Hi Experts, Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED? Currently using MQA, GQA for modeling:
- Llama2 (GQA)
- ChatGLM2-6B
- Falcon
- SantaCoder, StarCoder
Any comments will be appreciated.
llama 2 is all GQA?
34B and 70B
Got it!
FasterTransformer development has transitioned to TensorRT-LLM.
MQA and GQA are supported in TensorRT-LLM. Please take a try.
