CTranslate2 icon indicating copy to clipboard operation
CTranslate2 copied to clipboard

Quantzation AWQ GEMM + GEMV

Open minhthuc2502 opened this issue 1 year ago • 3 comments

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B:

Quant type Speed (tok/s) VRAM
int8 86,4 7722MiB
awq gemm 73 4746MiB
awq gemv 127 4746MiB

minhthuc2502 avatar Jun 19 '24 12:06 minhthuc2502

Support quantization 4 bit with AWQ. There are 2 stable versions available: gemm and gemv.

Currently, I only add AWQ for Llama and Mistral converter. Other models could be added easily if they need AWQ quant.

I did some benchmark with it:

With only batch_size = 1, model mistral 7B: Quant type Speed (tok/s) VRAM int8 86,4 7722MiB awq gemm 73 4746MiB awq gemv 127 4746MiB

Wow, that is spooky...I just started benchmarking AWQ last night for the first time. Do you think that eventually you'd want to incorporate the "exllama" option as well. For more info see here:

https://github.com/huggingface/transformers/blob/547b5582ec85147492f2485dd8e9cbbeb1016fd8/src/transformers/utils/quantization_config.py#L47

Also, would you mind sharing the script you used to benchmark or perhaps just some snippets? I wouldn't mind downloading a development branch and trying my hand at it.

BBC-Esq avatar Jun 19 '24 12:06 BBC-Esq

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

minhthuc2502 avatar Jun 19 '24 14:06 minhthuc2502

Currently, I only support GEMM and GEMV which are the most used version. It could be nice to support all in the future.

I did some benchmarks in the C++ only, I think you have to build this project in c++ first. BTW, the code where I used to benchmark is quite dirty but I will try to improve it and add it to the repo.

Thanks, I'm still learning to "build" anything (unsuccessfully as of yet...) believe it or not, but if you upload it I'll take a look.

BBC-Esq avatar Jun 19 '24 14:06 BBC-Esq

Can you share the code you used to benchmark?

BBC-Esq avatar Sep 09 '24 09:09 BBC-Esq

I used this. You can tweak a bit to create the correct prompt for the model used.

minhthuc2502 avatar Sep 10 '24 10:09 minhthuc2502