[Feature] Any Plan to implement INT8 weight-only quantizaton
Motivation
In most scenarios, weight-only INT8 Quantiztion will be the easiest way to reach nice performance without affecting accuracy.
Related resources
No response
Additional context
No response
We have tested the performance of INT4 weight-only quantization on OpenCompass. According to the results, the performance of INT4 quantization is on par with FP16.
If you can provide us with reproducible scenarios where INT4 shows significant performance degradation, we will consider expanding and incorporating more quantization algorithms. This will help us to understand the limitations of INT4 better and enhance our capacity to deliver optimal performance across different use cases.
We have tested the performance of INT4 weight-only quantization on OpenCompass. According to the results, the performance of INT4 quantization is on par with FP16.
If you can provide us with reproducible scenarios where INT4 shows significant performance degradation, we will consider expanding and incorporating more quantization algorithms. This will help us to understand the limitations of INT4 better and enhance our capacity to deliver optimal performance across different use cases.
Firstly, INT4 weight-only quantization isn't implemented yet. Just awq is implemented. Secondly, INT4 weight-only (not include any quantization algorithm) will cause performance degradation specially 7B model
AWQ is an INT4 Weight Only quantization algorithm.
The implementation of AWQ in LMDeploy includes numerous engineering optimizations, ensuring superior speed and accuracy compared to the official version.
We have uploaded our quantized models: llama2-7b, baichuan2-7b, and qwen-7b onto the HuggingFace Hub. There was no observed decrease in accuracy for any of these models following quantization.
ok. another question, awq quantization can provide a function that use own data not open data like 'c4' ?