FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

Are MQA and GQA in development?

Open ljayx opened this issue 2 years ago • 8 comments

Hi Experts,

Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED?

Currently using MQA, GQA for modeling:

  • Llama2 (GQA)
  • ChatGLM2-6B
  • Falcon
  • SantaCoder, StarCoder

Any comments will be appreciated.

ljayx avatar Jul 20 '23 07:07 ljayx

GQA has been supported by LMDeploy, which is developed based on FasterTransformer

lvhan028 avatar Jul 25 '23 03:07 lvhan028

mark +1

datalee avatar Aug 07 '23 00:08 datalee

Hi Experts,

Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED?

Currently using MQA, GQA for modeling:

  • Llama2 (GQA)
  • ChatGLM2-6B
  • Falcon
  • SantaCoder, StarCoder

Any comments will be appreciated.

llama 2 is all GQA?

bigmover avatar Aug 09 '23 12:08 bigmover

Only 70B model is GQA.

lvhan028 avatar Aug 09 '23 13:08 lvhan028

Hi Experts, Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED? Currently using MQA, GQA for modeling:

  • Llama2 (GQA)
  • ChatGLM2-6B
  • Falcon
  • SantaCoder, StarCoder

Any comments will be appreciated.

llama 2 is all GQA?

34B and 70B image

ljayx avatar Aug 10 '23 03:08 ljayx

Only 70B model is GQA.

Got it

bigmover avatar Aug 11 '23 08:08 bigmover

Hi Experts, Recently some of the emerging models use MQA (Multi-Query Attention) or GQA (Grouped-Query Attention), From issues list, I noticed that some users have already mentioned about the support of these two algorithms, and it's been quite a long time, can I ask is there any plan to support it, and when will the code be MERGED? Currently using MQA, GQA for modeling:

  • Llama2 (GQA)
  • ChatGLM2-6B
  • Falcon
  • SantaCoder, StarCoder

Any comments will be appreciated.

llama 2 is all GQA?

34B and 70B image

Got it!

bigmover avatar Aug 11 '23 08:08 bigmover

FasterTransformer development has transitioned to TensorRT-LLM.

MQA and GQA are supported in TensorRT-LLM. Please take a try.

byshiue avatar Oct 20 '23 07:10 byshiue