TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Support int type zero-points in weight-only GEMM

Open xiaonans opened this issue 1 year ago • 4 comments

Currently some quantized huggingface models save zero-points in int4 datatype directly, like Qwen/Qwen2-7B-Instruct-GPTQ-Int4 and Qwen/Qwen2-1.5B-Instruct-AWQ · Hugging Face.

But the weight_only_groupwise_quant_matmul in TensorRT-LLM only support fp16 zero-points as input, thus causing a data type conversion like https://github.com/NVIDIA/TensorRT-LLM/blob/a96cccafcf6365c128f004f779160951f8c0801c/tensorrt_llm/models/qwen/weight.py#L104.

For groupwise quantization, memory cost of zero-points is not neglected. Would you pls add int type zero-points support in the weight-only GEMM?

xiaonans avatar Jul 09 '24 10:07 xiaonans

@Tracin Could you please have a look? Thanks

QiJune avatar Jul 09 '24 12:07 QiJune

@xiaonans For group_size let's say 128, means every 128 4bit weights will have one fp16 zero_point. Memory ratio of zp / weight is 16 / (128 * 4). That's about 3% if my calculation is correct. I think it is neglected. On the other hand, dequantize zp in kernel will bring overhead. Do you agree with me?

Tracin avatar Jul 10 '24 09:07 Tracin

@xiaonans For group_size let's say 128, means every 128 4bit weights will have one fp16 zero_point. Memory ratio of zp / weight is 16 / (128 * 4). That's about 3% if my calculation is correct. I think it is neglected.

If group_size=64 and 2bit weights are used, the memory ratio of zero-point/weight can be about 12%, 16/(64*2).

On the other hand, dequantize zp in kernel will bring overhead. Do you agree with me?

If the fpA_intB_gemm kernel is able to load int4 zp directly from the global memory, overhead of this loading will be reduced compared with fp16 zp. When dealing with memory-bound scenarios, it should bring in acceleration.

xiaonans avatar Jul 11 '24 09:07 xiaonans

@xiaonans How to deploy this model(Qwen/Qwen2-1.5B-Instruct-AWQ · Hugging Face.) on TensorRT-LLM? because TensorRT-LLM only support fp16 transformer models. Thank you!

shaoyanguo avatar Aug 09 '24 11:08 shaoyanguo

@xiaonans If you have no further questions, we will close this issue in one week.

hello-11 avatar Nov 14 '24 06:11 hello-11