FasterTransformer Are MQA and GQA in development?

Hi Experts,

Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED?

Currently using MQA, GQA for modeling:

Llama2 (GQA)
ChatGLM2-6B
Falcon
SantaCoder, StarCoder

Any comments will be appreciated.

Jul 20 '23 07:07 ljayx

GQA has been supported by LMDeploy, which is developed based on FasterTransformer

Jul 25 '23 03:07 lvhan028

mark +1

Aug 07 '23 00:08 datalee

Hi Experts,

Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED?

Currently using MQA, GQA for modeling:

Llama2 (GQA)

ChatGLM2-6B

Falcon

SantaCoder, StarCoder

Any comments will be appreciated.

llama 2 is all GQA？

Aug 09 '23 12:08 bigmover

Only 70B model is GQA.

Aug 09 '23 13:08 lvhan028

Hi Experts, Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED? Currently using MQA, GQA for modeling:

Llama2 (GQA)

ChatGLM2-6B

Falcon

SantaCoder, StarCoder

Any comments will be appreciated.

llama 2 is all GQA？

34B and 70B

Aug 10 '23 03:08 ljayx

Only 70B model is GQA.

Got it

Aug 11 '23 08:08 bigmover

Hi Experts, Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED? Currently using MQA, GQA for modeling:

Llama2 (GQA)

ChatGLM2-6B

Falcon

SantaCoder, StarCoder

Any comments will be appreciated.

llama 2 is all GQA？

34B and 70B

Got it!

Aug 11 '23 08:08 bigmover

FasterTransformer development has transitioned to TensorRT-LLM.

MQA and GQA are supported in TensorRT-LLM. Please take a try.

Oct 20 '23 07:10 byshiue

FasterTransformer FasterTransformer copied to clipboard

Are MQA and GQA in development?

FasterTransformer
FasterTransformer copied to clipboard