llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

[Feature request] AWQ (activation-aware weight quantization) 4-bit quantization

Open hfassold opened this issue 2 years ago • 8 comments

There is a new 4-bit quantization method which looks interesting.

Paper is "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration " https://arxiv.org/abs/2306.00978 It contains both 3- and 4-bit, although I think the 4-bit variant is more interesting.

Github Repo with CUDA kernels is at https://github.com/mit-han-lab/llm-awq

hfassold avatar Jun 06 '23 08:06 hfassold

Looks quite interesting!

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration [Paper]

Efficient and accurate low-bit weight quantization (INT3/4) for LLMs, supporting instruction-tuned models and multi-modal LMs.

overview

The current release supports:

  • AWQ search for accurate quantization.
  • Pre-computed AWQ model zoo for LLMs (LLaMA, OPT, Vicuna, LLaVA; load to generate quantized weights).
  • Memory-efficient 4-bit Linear in PyTorch.
  • Efficient CUDA kernel implementation for fast inference (support context and decoding stage).
  • Examples on 4-bit inference of an instruction-tuned model (Vicuna) and multi-modal LM (LLaVA).

EwoutH avatar Jun 06 '23 11:06 EwoutH

That would require retraining the model, or one could possibly convert an existing one. I didn't quite understand that part of the paper.

FSSRepo avatar Jun 06 '23 16:06 FSSRepo

No, paper doesn't mention retraining, as far as I could understand it Also in the code I cannot find any training loop. Main function for generating quantized model seems to be 'run_awq' in https://github.com/mit-han-lab/llm-awq/blob/main/awq/quantize/pre_quant.py It does some sort of optimization internally (to get the optimal scale factors), but the optimization seems to do sort of 'brute force search' for optimal quantization. No gradients involved.

hfassold avatar Jun 06 '23 16:06 hfassold

Another new quantization technique introduced this week is SpQR by Tim Dettmers and his team. Both methods show potential over GPTQ. It would be interesting to see them compared directly.

Paper: https://arxiv.org/pdf/2306.03078.pdf GitHub: https://github.com/Vahe1994/SpQR Twitter Thread: https://twitter.com/Tim_Dettmers/status/1666076553665744896

WesCook avatar Jun 06 '23 19:06 WesCook

That would require retraining the model, or one could possibly convert an existing one. I didn't quite understand that part of the paper.

AWQ doesn't require retraining, it merely is a smarter way of quantizing the existing fp16 weights. Basically the paper claims that weights causing a larger activation magnitude are more important to the quality of the model and thus should be kept as accurate as possible. Meanwhile weights which don't affect the activation much can be stored more inaccurately without issue.

There's some math in there which I don't completely understand which actually optimizes the scaling of the quants in order to protect the weights with a larger activation. The formula requires what is deemed as "the input features cached from a small calibration set (we take a small calibration set from the pre-training dataset in order not to overfit to a specific task)". I'm not too sure what that dataset entails but that does not look like retraining.

ghost avatar Jun 07 '23 02:06 ghost

@WesCook Thanks, SpQR looks also interesting. Although AWQ seems to be the 'easier' format (to understand and implement). Just from a first look at both papers.

hfassold avatar Jun 07 '23 09:06 hfassold

Approximate performance is:

LLaMA-7B BPW Wikitext2
FP16 16 5.68
RTN 4.00 6.29
RTN 3.00 25.54
GPTQ-4b-128g 4.15 5.85
GPTQ-3b-128g 3.15 6.61
AWQ-4b-128g 4.15 5.81
AWQ-3b-128g 3.15 6.46
AWQ-3b-32g 3.60 6.10
SpQR-3b-16g-3b-32g-0.4% 3.63 5.73

SpQR's performance is much better. However, SpQR is complex to implement. AWQ, on the other hand, can be saved in the same format as GPTQ, so you can make it compatible with GGML with minor changes.

qwopqwop200 avatar Jun 07 '23 10:06 qwopqwop200

https://github.com/qwopqwop200/llm-awq This is the AWQ code that has been changed to save in a format similar to GPTQ. The differences are:

  1. Some variables have been renamed.
  2. zeros -= 1 is not use

qwopqwop200 avatar Jun 07 '23 13:06 qwopqwop200

Will you PR it to AutoGPTQ qwopqwop?

TheBloke avatar Jun 09 '23 08:06 TheBloke

Currently I have no plans to put a PR on AutoGPTQ.

qwopqwop200 avatar Jun 09 '23 17:06 qwopqwop200

Hi everyone, I have tried to make a PR to add AWQ. I really appreciate the comments to make it better, thanks! The PR: Add AWQ

namtranase avatar Dec 22 '23 09:12 namtranase

This can be closed.

kalomaze avatar Dec 31 '23 19:12 kalomaze

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar Apr 10 '24 01:04 github-actions[bot]