llm-awq INT4-AWQ PPL results for LLaMA-2 model are not as expected

Hi, I have a question about why the int4-AWQ PPL results of the LLaMA-2 model very different from the paper? in paper result, LLaMA-2 + int4-AWQ PPL result is 5.60, but I got PPL result is 16.38. I want to confirm if my 16.38 result is correct? And why are the results so different?

Implementation steps：

Perform AWQ search and save search results (already did it in awq_cache)
Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
Generate real quantized weights (INT4)
Load and evaluate the real quantized model (now you can see smaller gpu memory usage) python -m awq.entry --model_path llama-2-7b-hf
--tasks wikitext
--w_bit 4 --q_group_size 128
--load_quant quant/llama-2-7b-hf-w4-g128-awq.pt

Nov 03 '23 05:11 xianwujie

Hi, I think the PPL is definitely too high. Are you using the pre-computed scales from our huggingface repo?

Nov 04 '23 03:11 tonylins

Hi @tonylins Yes, I used llm-awq/awq_cache/llama-7b-w4-g128.pt .

git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

and Generate real quantized weights (INT4) commands:

python -m awq.entry --model_path ../models/llama-2-7b-hf --w_bit 4 --q_group_size 128 --load_awq awq_cache/llama-2-7b-hf-w4-g128.pt --q_backend real --dump_quant quant/llama-2-7b-hf-w4-g128-awq.pt

Any other suggestions, please?

Nov 06 '23 02:11 xianwujie

I think you are using Llama-1 cache instead of Llama-2 cache. You should use llama-2-7b-w4-g128.pt instead.

Nov 06 '23 15:11 casper-hansen

Hi @casper-hansen ,thanks for your advice. I don't quite understand why you think I'm using llama-1cache, I specified on the command line to use the llama-2-7b-w4-g128.pt. There are no changes to the code. Do I need to set cache usage somewhere in the code?

Nov 07 '23 05:11 xianwujie

Hi @xianwujie , thank you for raising this issue!

The PPL results in our paper were obtained using the GPTQ evaluation, but our current codebase uses lm-eval-harness for evaluation, which could lead to some PPL differences. To reproduce our results in the paper, you can try using the GPTQ evaluation code.

But I think the 16.38 PPL on lm-eval-harness evaluation still seems too high. I would suggest first checking your fake quantization PPL is consistent with the real quantized model's PPL. If there is a large gap, it could indicate some problems in the real quantization process.

Nov 07 '23 15:11 Sakits

Hi @Sakits , thanks for your reply. I verified fake quantization PPL with awq_cache/llama-2-7b-w4-g128.pt, and got 16.387 PPL result.

python -m awq.entry --model_path ../models/llama-2-7b --tasks wikitext --w_bit 4 --q_group_size 128 --load_awq awq_cache/llama-2-7b-w4-g128.pt --q_backend fake

It looks like a problem with the PPL implementation, I will verify in the GPTQ evaluation, thank you again.

Nov 08 '23 03:11 xianwujie

Hi @Sakits, I just verified the PPL with llama-2-7b-hf and the benchmark in GPTQ-for-LLaMa

# fp16
python llama.py ../llama-2-7b-hf/ wikitext2 --benchmark 2048 --check

# INT4g128
python llama.py ../llama-2-7b-hf/ wikitext2 --wbits 4 --true-sequential --groupsize 128 --save ./llama-7b-4bit-gs128_test.pt

python llama.py ../llama-2-7b-hf/ wikitext2 --wbits 4 --true-sequential --groupsize 128 --load ./llama-7b-4bit-gs128_test.pt --benchmark 2048 --check

But I got 6.78 and 7.03 for FP16 and INT4g128 respectively, which are still bigger than 5.47(Llama-2-7B FP16) and 5.69(Llama-2-7B GPTQ) in the Table 4 of paper.

Do I use the correct model and GPTQ evaluation code?

Nov 24 '23 02:11 noob-cod

Hi @noob-cod , I noticed the perplexity you reported for FP16 model is quite high at 6.78. It's possible you may be using the wrong evaluation code. Please double check that you evaluated on the wikitext dataset. I would suggest you use the code from https://github.com/mit-han-lab/llm-awq/pull/111 for evaluation.

Nov 24 '23 18:11 Sakits

Hi @Sakits , thanks for your reply. I can get the same perplexity value (5.47) as the FP16 model in the paper.

Nov 27 '23 04:11 noob-cod