FasterTransformer # feature request # GPT-Q 4 bit support

Thank you for your kind work. FasterTransformer is indeed a remarkable achievement that benefits many people. It can significantly accelerate many models in LLM.int8() mode, which is truly incredible. However, in recent months, 4-bit quantization has become increasingly popular, such as GPT-Q, which can be found at https://github.com/PanQiWei/AutoGPTQ

By quantizing the model in 4-bit, the latency can be reduced by almost half. Therefore, will FasterTransformer support 4-bit quantization in the future? It would bring great benefit to the speed of LLMs.

Jul 11 '23 11:07 Xingxiangrui

Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time

Jul 13 '23 03:07 77h2l

Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time

FT int8 support for bloom have already been supported.

Based on my experiment results on A100 80GB, the following time (in milliseconds) was taken for each token:

Torch Bloom-3B: 16.66ms/token
FT for Bloom-3B with fp16: 5.46ms/token
FT for Bloom-3B with int8 weight: 4.92ms/token

hope these would help you.

Jul 14 '23 05:07 Xingxiangrui

Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time

FT int8 support for bloom have already been supported.

Based on my experiment results on A100 80GB, the following time (in milliseconds) was taken for each token:

Torch Bloom-3B: 16.66ms/token

FT for Bloom-3B with fp16: 5.46ms/token

FT for Bloom-3B with int8 weight: 4.92ms/token

hope these would help you.

You mean FT official has supported GPTQ-int8 for Bloom, but I did not see it, you can put the link if you're willing and I will check it.Actually for most LLMs quantization algorithm like the sota gptq, bellow than 5-bit qnt will increasingly influence the acc,maybe you can try this one: https://github.com/mit-han-lab/llm-awq hope it helps

Jul 17 '23 01:07 77h2l

not GPT-Q int8, just FT weight int8, see following link .

https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/examples/pytorch/gpt/bloom_lambada.py#L165-L170

thank you for the llm-awq link, I will check it.

Jul 19 '23 01:07 Xingxiangrui

not GPT-Q int8, just FT weight int8, see following link .

https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/examples/pytorch/gpt/bloom_lambada.py#L165-L170

thank you for the llm-awq link, I will check it.

@Xingxiangrui thanksf for sharing the numbers above and the cope example here, do you know if FT supports SmoothQ for bloom? I see SmoothQ is implemented in FT and int8_mode==2 corresponding to SmoothQ quantization.

Jul 26 '23 06:07 EarthXP

FasterTransformer development has transitioned to TensorRT-LLM.

GPT-Q is supported in TensorRT-LLM. Please take a try.

Oct 20 '23 07:10 byshiue

FasterTransformer FasterTransformer copied to clipboard

# feature request # GPT-Q 4 bit support

FasterTransformer
FasterTransformer copied to clipboard