pyllama Quantization with "groupsize" makes the results completely wrong.

Hi,

I'm quantizing the models following the README but there's one common thing while using the groupsize parameter - in each case the perplexity goes to the roof and the results are completely wrong. For example, quantizing 7B model with 4 bits, perplexity:

wikitext2: 7.462815284729004
ptb:       11.122198104858398
c4:        8.211784362792969

And the same model with 4 bits and --groupsize 128:

wikitext2: 243848.546875
ptb:       309488.53125
c4:        240030.015625

And the results for input What's the Earth?:

4b:

🦙: What's the Earth?
So what's the earth? It's a planet.
Which one? Well, the one that revolves around the sun.
Now that's true, but what does that mean?

4b, group size of 128:

🦙: What's the Earth?örtfitolly Alburd Tob fitpaunity Tobżyurd girlsurd fitattanattan�ört SE�ży girlsolly Podpois Siegunityunityollyź�éliollyört Nationpois Pod girls finalepoisazineattan

Any idea what's going on?

If this matters, I'm using Python 3.8 in ubuntu 22.04 running in WSL

Mar 31 '23 07:03 daniel-kukiela

Yup. I'm seeing this too. Can't figure it out.

Apr 03 '23 17:04 regstuff

2-bit quantization does not seem to work either (no matter if with or without the groupsize parameter).

Apr 04 '23 03:04 daniel-kukiela

I have the same problem! python llama/llama_quant.py ./models/llama-7B-hf/llama-7b c4 --ckpt_dir ./models/llama-7B-hf/llama-7b --tokenizer_path ./models/llama-7B-hf/tokenizer/tokenizer.model --wbits 4 --groupsize 128 --save ./models/pyllama-7B4b.pt

wikitext2: 213490.984375 ptb: 259118.59375 c4: 207443.609375

Apr 07 '23 18:04 chigkim

I also see a garbage after quantization. Will try w/o this flag to confirm if it works.

Apr 08 '23 22:04 sskorol

8-bit and 4-bit quantization without groupsuze=128 works. 2-bit quantization does not and return garbage output. groupsize=128 in each case causes garbage output too.

Apr 09 '23 16:04 daniel-kukiela

8-bit and 4-bit quantization without groupsuze=128 works. 2-bit quantization does not and return garbage output. groupsize=128 in each case causes garbage output too.

I used groupsize 128 for 4 bits but results are awful:

(base) ✔ desktop:~/dev/projects/ai/pyllama [main|✔]> python quant_infer.py --wbits 4 --load ../pyllama-7B4b.pt --text "the meaning of life is" --max_length 24 --cuda cuda:0
⌛️ Loading model from ../pyllama-7B4b.pt...
✅ Model from ../pyllama-7B4b.pt is loaded successfully.
********************************************************************************
🦙: the meaning of life isurd Intży Lewnierunitypoispois Int Alburd girlslebź Intpois girlshalb
****************************** GPU/CPU/Latency Profiling ******************************

Apr 14 '23 21:04 webpolis

4-bits w/o groupsize worked for me as well.

Apr 14 '23 22:04 sskorol

Same issue

Apr 15 '23 10:04 iaalm

pyllama pyllama copied to clipboard

Quantization with "groupsize" makes the results completely wrong.

pyllama
pyllama copied to clipboard