INT4-AWQ PPL results for LLaMA-2 model are not as expected
Hi, I have a question about why the int4-AWQ PPL results of the LLaMA-2 model very different from the paper? in paper result, LLaMA-2 + int4-AWQ PPL result is 5.60, but I got PPL result is 16.38. I want to confirm if my 16.38 result is correct? And why are the results so different?
Implementation steps:
- Perform AWQ search and save search results (already did it in awq_cache)
- Evaluate the AWQ quantized model on WikiText-2 (simulated pseudo quantization)
- Generate real quantized weights (INT4)
- Load and evaluate the real quantized model (now you can see smaller gpu memory usage)
python -m awq.entry --model_path llama-2-7b-hf
--tasks wikitext
--w_bit 4 --q_group_size 128
--load_quant quant/llama-2-7b-hf-w4-g128-awq.pt
Hi, I think the PPL is definitely too high. Are you using the pre-computed scales from our huggingface repo?
Hi @tonylins Yes, I used llm-awq/awq_cache/llama-7b-w4-g128.pt .
- git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
and Generate real quantized weights (INT4) commands:
- python -m awq.entry --model_path ../models/llama-2-7b-hf --w_bit 4 --q_group_size 128 --load_awq awq_cache/llama-2-7b-hf-w4-g128.pt --q_backend real --dump_quant quant/llama-2-7b-hf-w4-g128-awq.pt
Any other suggestions, please?
I think you are using Llama-1 cache instead of Llama-2 cache. You should use llama-2-7b-w4-g128.pt instead.
Hi @casper-hansen ,thanks for your advice. I don't quite understand why you think I'm using llama-1cache, I specified on the command line to use the llama-2-7b-w4-g128.pt. There are no changes to the code. Do I need to set cache usage somewhere in the code?
Hi @xianwujie , thank you for raising this issue!
The PPL results in our paper were obtained using the GPTQ evaluation, but our current codebase uses lm-eval-harness for evaluation, which could lead to some PPL differences. To reproduce our results in the paper, you can try using the GPTQ evaluation code.
But I think the 16.38 PPL on lm-eval-harness evaluation still seems too high. I would suggest first checking your fake quantization PPL is consistent with the real quantized model's PPL. If there is a large gap, it could indicate some problems in the real quantization process.
Hi @Sakits , thanks for your reply. I verified fake quantization PPL with awq_cache/llama-2-7b-w4-g128.pt, and got 16.387 PPL result.
- python -m awq.entry --model_path ../models/llama-2-7b --tasks wikitext --w_bit 4 --q_group_size 128 --load_awq awq_cache/llama-2-7b-w4-g128.pt --q_backend fake
It looks like a problem with the PPL implementation, I will verify in the GPTQ evaluation, thank you again.
Hi @Sakits, I just verified the PPL with llama-2-7b-hf and the benchmark in GPTQ-for-LLaMa
# fp16
python llama.py ../llama-2-7b-hf/ wikitext2 --benchmark 2048 --check
# INT4g128
python llama.py ../llama-2-7b-hf/ wikitext2 --wbits 4 --true-sequential --groupsize 128 --save ./llama-7b-4bit-gs128_test.pt
python llama.py ../llama-2-7b-hf/ wikitext2 --wbits 4 --true-sequential --groupsize 128 --load ./llama-7b-4bit-gs128_test.pt --benchmark 2048 --check
But I got 6.78 and 7.03 for FP16 and INT4g128 respectively, which are still bigger than 5.47(Llama-2-7B FP16) and 5.69(Llama-2-7B GPTQ) in the Table 4 of paper.
Do I use the correct model and GPTQ evaluation code?
Hi @noob-cod , I noticed the perplexity you reported for FP16 model is quite high at 6.78. It's possible you may be using the wrong evaluation code. Please double check that you evaluated on the wikitext dataset. I would suggest you use the code from https://github.com/mit-han-lab/llm-awq/pull/111 for evaluation.
Hi @Sakits , thanks for your reply. I can get the same perplexity value (5.47) as the FP16 model in the paper.