pyllama
pyllama copied to clipboard
Quantization with "groupsize" makes the results completely wrong.
Hi,
I'm quantizing the models following the README but there's one common thing while using the groupsize parameter - in each case the perplexity goes to the roof and the results are completely wrong.
For example, quantizing 7B model with 4 bits, perplexity:
wikitext2: 7.462815284729004
ptb: 11.122198104858398
c4: 8.211784362792969
And the same model with 4 bits and --groupsize 128:
wikitext2: 243848.546875
ptb: 309488.53125
c4: 240030.015625
And the results for input What's the Earth?:
- 4b:
🦙: What's the Earth?
So what's the earth? It's a planet.
Which one? Well, the one that revolves around the sun.
Now that's true, but what does that mean?
- 4b, group size of 128:
🦙: What's the Earth?örtfitolly Alburd Tob fitpaunity Tobżyurd girlsurd fitattanattan�ört SE�ży girlsolly Podpois Siegunityunityollyź�éliollyört Nationpois Pod girls finalepoisazineattan
Any idea what's going on?
If this matters, I'm using Python 3.8 in ubuntu 22.04 running in WSL
Yup. I'm seeing this too. Can't figure it out.
2-bit quantization does not seem to work either (no matter if with or without the groupsize parameter).
I have the same problem!
python llama/llama_quant.py ./models/llama-7B-hf/llama-7b c4 --ckpt_dir ./models/llama-7B-hf/llama-7b --tokenizer_path ./models/llama-7B-hf/tokenizer/tokenizer.model --wbits 4 --groupsize 128 --save ./models/pyllama-7B4b.pt
wikitext2: 213490.984375 ptb: 259118.59375 c4: 207443.609375
I also see a garbage after quantization. Will try w/o this flag to confirm if it works.
8-bit and 4-bit quantization without groupsuze=128 works. 2-bit quantization does not and return garbage output. groupsize=128 in each case causes garbage output too.
8-bit and 4-bit quantization without
groupsuze=128works. 2-bit quantization does not and return garbage output.groupsize=128in each case causes garbage output too.
I used groupsize 128 for 4 bits but results are awful:
(base) ✔ desktop:~/dev/projects/ai/pyllama [main|✔]> python quant_infer.py --wbits 4 --load ../pyllama-7B4b.pt --text "the meaning of life is" --max_length 24 --cuda cuda:0
⌛️ Loading model from ../pyllama-7B4b.pt...
✅ Model from ../pyllama-7B4b.pt is loaded successfully.
********************************************************************************
🦙: the meaning of life isurd Intży Lewnierunitypoispois Int Alburd girlslebź Intpois girlshalb
****************************** GPU/CPU/Latency Profiling ******************************
4-bits w/o groupsize worked for me as well.
Same issue