llama.cpp
llama.cpp copied to clipboard
[Feature request] AWQ (activation-aware weight quantization) 4-bit quantization
There is a new 4-bit quantization method which looks interesting.
Paper is "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration " https://arxiv.org/abs/2306.00978 It contains both 3- and 4-bit, although I think the 4-bit variant is more interesting.
Github Repo with CUDA kernels is at https://github.com/mit-han-lab/llm-awq
Looks quite interesting!
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Paper]
Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.
The current release supports:
- AWQ search for accurate quantization.
- Pre-computed AWQ model zoo for LLMs (LLaMA, OPT, Vicuna, LLaVA; load to generate quantized weights).
- Memory-efficient 4-bit Linear in PyTorch.
- Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
- Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (LLaVA).
That would require retraining the model, or one could possibly convert an existing one. I didn't quite understand that part of the paper.
No, paper doesn't mention retraining, as far as I could understand it Also in the code I cannot find any training loop. Main function for generating quantized model seems to be 'run_awq' in https://github.com/mit-han-lab/llm-awq/blob/main/awq/quantize/pre_quant.py It does some sort of optimization internally (to get the optimal scale factors), but the optimization seems to do sort of 'brute force search' for optimal quantization. No gradients involved.
Another new quantization technique introduced this week is SpQR by Tim Dettmers and his team. Both methods show potential over GPTQ. It would be interesting to see them compared directly.
Paper: https://arxiv.org/pdf/2306.03078.pdf GitHub: https://github.com/Vahe1994/SpQR Twitter Thread: https://twitter.com/Tim_Dettmers/status/1666076553665744896
That would require retraining the model, or one could possibly convert an existing one. I didn't quite understand that part of the paper.
AWQ doesn't require retraining, it merely is a smarter way of quantizing the existing fp16 weights. Basically the paper claims that weights causing a larger activation magnitude are more important to the quality of the model and thus should be kept as accurate as possible. Meanwhile weights which don't affect the activation much can be stored more inaccurately without issue.
There's some math in there which I don't completely understand which actually optimizes the scaling of the quants in order to protect the weights with a larger activation. The formula requires what is deemed as "the input features cached from a small calibration set (we take a small calibration set from the pre-training dataset in order not to overfit to a specific task)". I'm not too sure what that dataset entails but that does not look like retraining.
@WesCook Thanks, SpQR looks also interesting. Although AWQ seems to be the 'easier' format (to understand and implement). Just from a first look at both papers.
Approximate performance is:
| LLaMA-7B | BPW | Wikitext2 |
|---|---|---|
| FP16 | 16 | 5.68 |
| RTN | 4.00 | 6.29 |
| RTN | 3.00 | 25.54 |
| GPTQ-4b-128g | 4.15 | 5.85 |
| GPTQ-3b-128g | 3.15 | 6.61 |
| AWQ-4b-128g | 4.15 | 5.81 |
| AWQ-3b-128g | 3.15 | 6.46 |
| AWQ-3b-32g | 3.60 | 6.10 |
| SpQR-3b-16g-3b-32g-0.4% | 3.63 | 5.73 |
SpQR's performance is much better. However, SpQR is complex to implement. AWQ, on the other hand, can be saved in the same format as GPTQ, so you can make it compatible with GGML with minor changes.
https://github.com/qwopqwop200/llm-awq This is the AWQ code that has been changed to save in a format similar to GPTQ. The differences are:
- Some variables have been renamed.
zeros -= 1is not use
Will you PR it to AutoGPTQ qwopqwop?
Currently I have no plans to put a PR on AutoGPTQ.
Hi everyone, I have tried to make a PR to add AWQ. I really appreciate the comments to make it better, thanks! The PR: Add AWQ
This can be closed.
This issue was closed because it has been inactive for 14 days since being marked as stale.