FasterTransformer
FasterTransformer copied to clipboard
# feature request # GPT-Q 4 bit support
Thank you for your kind work. FasterTransformer is indeed a remarkable achievement that benefits many people. It can significantly accelerate many models in LLM.int8() mode, which is truly incredible. However, in recent months, 4-bit quantization has become increasingly popular, such as GPT-Q, which can be found at https://github.com/PanQiWei/AutoGPTQ
By quantizing the model in 4-bit, the latency can be reduced by almost half. Therefore, will FasterTransformer support 4-bit quantization in the future? It would bring great benefit to the speed of LLMs.
Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time
Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time
FT int8 support for bloom have already been supported.
Based on my experiment results on A100 80GB, the following time (in milliseconds) was taken for each token:
- Torch Bloom-3B: 16.66ms/token
- FT for Bloom-3B with fp16: 5.46ms/token
- FT for Bloom-3B with int8 weight: 4.92ms/token
hope these would help you.
Even though,FT has not supported int8 quantization for popular LLMs,like bloom and so on quantizated to 4-bit not indeed accelerate models,it depends on your hardware computes, and the model accuracy will drop at the same time
FT int8 support for bloom have already been supported.
Based on my experiment results on A100 80GB, the following time (in milliseconds) was taken for each token:
- Torch Bloom-3B: 16.66ms/token
- FT for Bloom-3B with fp16: 5.46ms/token
- FT for Bloom-3B with int8 weight: 4.92ms/token
hope these would help you.
You mean FT official has supported GPTQ-int8 for Bloom, but I did not see it, you can put the link if you're willing and I will check it.Actually for most LLMs quantization algorithm like the sota gptq, bellow than 5-bit qnt will increasingly influence the acc,maybe you can try this one: https://github.com/mit-han-lab/llm-awq hope it helps
not GPT-Q int8, just FT weight int8, see following link .
https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/examples/pytorch/gpt/bloom_lambada.py#L165-L170
thank you for the llm-awq link, I will check it.
not GPT-Q int8, just FT weight int8, see following link .
https://github.com/NVIDIA/FasterTransformer/blob/f8e42aac45815c5be92c0915b12b9a6652386e8c/examples/pytorch/gpt/bloom_lambada.py#L165-L170
thank you for the llm-awq link, I will check it.
@Xingxiangrui thanksf for sharing the numbers above and the cope example here, do you know if FT supports SmoothQ for bloom? I see SmoothQ is implemented in FT and int8_mode==2 corresponding to SmoothQ quantization.
FasterTransformer development has transitioned to TensorRT-LLM.
GPT-Q is supported in TensorRT-LLM. Please take a try.