CTranslate2 Quantzation AWQ GEMM + GEMV

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B:

Quant type	Speed (tok/s)	VRAM
int8	86,4	7722MiB
awq gemm	73	4746MiB
awq gemv	127	4746MiB

Jun 19 '24 12:06 minhthuc2502

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B: Quant type Speed (tok/s) VRAM int8 86,4 7722MiB awq gemm 73 4746MiB awq gemv 127 4746MiB

Wow, that is spooky...I just started benchmarking AWQ last night for the first time. Do you think that eventually you'd want to incorporate the "exllama" option as well. For more info see here:

https://github.com/huggingface/transformers/blob/547b5582ec85147492f2485dd8e9cbbeb1016fd8/src/transformers/utils/quantization_config.py#L47

Also, would you mind sharing the script you used to benchmark or perhaps just some snippets? I wouldn't mind downloading a development branch and trying my hand at it.

Jun 19 '24 12:06 BBC-Esq

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

Jun 19 '24 14:06 minhthuc2502

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

Thanks, I'm still learning to "build" anything (unsuccessfully as of yet...) believe it or not, but if you upload it I'll take a look.

Jun 19 '24 14:06 BBC-Esq

Can you share the code you used to benchmark?

Sep 09 '24 09:09 BBC-Esq

I used this. You can tweak a bit to create the correct prompt for the model used.

Sep 10 '24 10:09 minhthuc2502

CTranslate2 CTranslate2 copied to clipboard

Quantzation AWQ GEMM + GEMV

CTranslate2
CTranslate2 copied to clipboard