FasterTransformer icon indicating copy to clipboard operation
FasterTransformer copied to clipboard

# feature request # GPT-Q 4 bit support

Open Xingxiangrui opened this issue 1 year ago • 6 comments

Thank you for your kind work. FasterTransformer is indeed a remarkable achievement that benefits many people. It can significantly accelerate many models in LLM.int8() mode, which is truly incredible. However, in recent months, 4-bit quantization has become increasingly popular, such as GPT-Q, which can be found at https://github.com/PanQiWei/AutoGPTQ

By quantizing the model in 4-bit, the latency can be reduced by almost half. Therefore, will FasterTransformer support 4-bit quantization in the future? It would bring great benefit to the speed of LLMs.

Xingxiangrui avatar Jul 11 '23 11:07 Xingxiangrui

Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time

77h2l avatar Jul 13 '23 03:07 77h2l

Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time

FT int8 support for bloom have already been supported.

Based on my experiment results on A100 80GB, the following time (in milliseconds) was taken for each token:

  • Torch Bloom-3B: 16.66ms/token
  • FT for Bloom-3B with fp16: 5.46ms/token
  • FT for Bloom-3B with int8 weight: 4.92ms/token

hope these would help you.

Xingxiangrui avatar Jul 14 '23 05:07 Xingxiangrui

Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time

FT int8 support for bloom have already been supported.

Based on my experiment results on A100 80GB, the following time (in milliseconds) was taken for each token:

  • Torch Bloom-3B: 16.66ms/token
  • FT for Bloom-3B with fp16: 5.46ms/token
  • FT for Bloom-3B with int8 weight: 4.92ms/token

hope these would help you.

You mean FT official has supported GPTQ-int8 for Bloom, but I did not see it, you can put the link if you're willing and I will check it.Actually for most LLMs quantization algorithm like the sota gptq, bellow than 5-bit qnt will increasingly influence the acc,maybe you can try this one: https://github.com/mit-han-lab/llm-awq hope it helps

77h2l avatar Jul 17 '23 01:07 77h2l

not GPT-Q int8, just FT weight int8, see following link .

https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/examples/pytorch/gpt/bloom_lambada.py#L165-L170

thank you for the llm-awq link, I will check it.

Xingxiangrui avatar Jul 19 '23 01:07 Xingxiangrui

not GPT-Q int8, just FT weight int8, see following link .

https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/examples/pytorch/gpt/bloom_lambada.py#L165-L170

thank you for the llm-awq link, I will check it.

@Xingxiangrui thanksf for sharing the numbers above and the cope example here, do you know if FT supports SmoothQ for bloom? I see SmoothQ is implemented in FT and int8_mode==2 corresponding to SmoothQ quantization.

EarthXP avatar Jul 26 '23 06:07 EarthXP

FasterTransformer development has transitioned to TensorRT-LLM.

GPT-Q is supported in TensorRT-LLM. Please take a try.

byshiue avatar Oct 20 '23 07:10 byshiue