lmdeploy icon indicating copy to clipboard operation
lmdeploy copied to clipboard

[Feature] Any Plan to implement INT8 weight-only quantizaton

Open yunzhongyan0 opened this issue 2 years ago • 4 comments

Motivation

In most scenarios, weight-only INT8 Quantiztion will be the easiest way to reach nice performance without affecting accuracy.

Related resources

No response

Additional context

No response

yunzhongyan0 avatar Nov 17 '23 06:11 yunzhongyan0

We have tested the performance of INT4 weight-only quantization on OpenCompass. According to the results, the performance of INT4 quantization is on par with FP16.

If you can provide us with reproducible scenarios where INT4 shows significant performance degradation, we will consider expanding and incorporating more quantization algorithms. This will help us to understand the limitations of INT4 better and enhance our capacity to deliver optimal performance across different use cases.

pppppM avatar Nov 17 '23 07:11 pppppM

We have tested the performance of INT4 weight-only quantization on OpenCompass. According to the results, the performance of INT4 quantization is on par with FP16.

If you can provide us with reproducible scenarios where INT4 shows significant performance degradation, we will consider expanding and incorporating more quantization algorithms. This will help us to understand the limitations of INT4 better and enhance our capacity to deliver optimal performance across different use cases.

Firstly, INT4 weight-only quantization isn't implemented yet. Just awq is implemented. Secondly, INT4 weight-only (not include any quantization algorithm) will cause performance degradation specially 7B model

yunzhongyan0 avatar Nov 17 '23 07:11 yunzhongyan0

AWQ is an INT4 Weight Only quantization algorithm.

The implementation of AWQ in LMDeploy includes numerous engineering optimizations, ensuring superior speed and accuracy compared to the official version.

We have uploaded our quantized models: llama2-7b, baichuan2-7b, and qwen-7b onto the HuggingFace Hub. There was no observed decrease in accuracy for any of these models following quantization.

pppppM avatar Nov 17 '23 08:11 pppppM

ok. another question, awq quantization can provide a function that use own data not open data like 'c4' ?

yunzhongyan0 avatar Nov 17 '23 09:11 yunzhongyan0