GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
T5 Benchmark
Thank you for the repo.
I am curious what benchmark results (MMLU and BBH) we shall expect for the gptq-flan-t5 models. I am getting an average accuracy of 25.2% for MMLU using the xl version (4bit, 128 groupsize). It seems a bit far off from the results posted on flan-eval
. Am I missing something?
Quantization doesn't seem to work well for unknown reasons. I tried many things to solve this but it still doesn't work. Because of this, I am not merging this now
Ah. Got it~ Thank you. Hope someone can provide insights on this.
y... I was noticing same issue comparing to int8 quantization performance is not that good yet....
I'm comparing flan-t5-xxl int8 quant using this script... (will fit on a 24G card) https://github.com/johnrobinsn/flan_ul2/blob/main/infer-flan-ul2-int8-basemodel.py
against flan-t5-xxl int4 using this script... https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/t5/t5_inference.py
and y... pretty bad results with int4... in the morning I'll pull the benchmark into my int8 script so I can get some numbers to compare...
and try to look deeper at the quantization code.
@jasontian6666 by flan-eval you're talking about this..? thx
https://github.com/declare-lab/flan-eval
@jasontian6666 by flan-eval you're talking about this..? thx
https://github.com/declare-lab/flan-eval
Yes, this is the repo I'm referring to. Are their results based on int8?
they have a --load_8bit flag that I'm trying now with flan-t5-xxl
@jasontian6666 can you share your code for everyone to check?
@jasontian6666 can you share your code for everyone to check?
I just use the exact code that the author provided in the readme for conversion and benchmark (simply replacing small model with xl).
@jasontian6666 do you still keep the llama code in the source code to run for T5 model?
@jasontian6666 do you still keep the llama code in the source code to run for T5 model?
Are we talking about the code in the t5 branch?
@jasontian6666 oki, let's check together
Just collecting some details on the performance between int8 quant vs int4 quant for t5 models (not llama)
Using flan-eval to eval int8 quant performance int8;flan-t5-xxl;mmlu => Average accuracy: 0.544 int8;flan-t5-xxl;bbh => 'average': 0.4379700192171589
using the t5 branch with python t5.py google/flan-t5-xxl wikitext2 --wbits 4 --act-order --groupsize 128 ... int4;flan-t5-xxl;mmlu => Average accuracy: 0.237 int4;flan-t5-xxl;bbh => Average accuracy: 0.203
pretty big loss in performance as it stands.
Following - also seeing very degraded performance on a flan based model that has downstream finetuning. Have you all tried experimenting with the nsamples or percdamp parameters?
Anecdotally - I notice the error prints during quantization in 8 bit are 3-4 magnitudes of order higher than 8 bit, particularly demonstrated in the dense layers.
The flan models are already finetuned... so even before additional finetuning... seeing the degradation as captured above.
@bradfox2 But that's a good thought; increasing the sampling rate etc, might improve the quality of the quantization quite a bit. Definitely worth some experiments.
@qwopqwop200 did you try tweaking those parameters and see any improvement?
I'm somewhat arbitrarily trying (bumping up the percdamp and nsamples defaults) CUDA_VISIBLE_DEVICES=1 python t5.py google/flan-t5-xxl wikitext2 --wbits 4 --act-order --groupsize 128 --percdamp 0.1 --nsamples 256 --save t5-xxl-4bit-128g-ns256-da10.pt
Using these params got a negligible lift... int4;flan-t5-xxl;mmlu => Average accuracy: 0.276
For MMLU benchmark, I have Flan-t5-large:
- float16: 41.9%
- 8bit: 36.8%
- 4bit: 35.1%
Flan-t5-xl:
- float16: 49.3%
- 8bit: 25.4%
- 4bit: 25.2%
Not much lost of performance from 8bit to 4bit. But quantization seems to hurt.
I'm getting 0.294 for FLAN-T5-XXL with --act-order --groupsize 128 --percdamp .05 --nsamples 512. Recommend to try higher nsamples as per GPTQ authors recommendation for FLAN-T5-XXL.
@jasontian6666 @bradfox2 watching this https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/189
might help t5
@jasontian6666 @bradfox2 watching this https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/189
might help t5
Thank you. Will keep an eye on it. Hope it can help.
@johnrobinsn thanks. Are you planning on testing again with the merged code?
Quant error analysis on LLaMa-7B, hope it helps @bradfox2 @bradfox2 cc @jasontian6666 @qwopqwop200
More tests are still on the way QvQ
@bradfox2 y... I would like to, although might take me a week or so to circle back to it. will need to merged into the t5 branch...
Flan T5 has some weight on FP32 for large variants (XL and XXL). Transformers lib has a patch for this: https://github.com/huggingface/transformers/commit/b9b70b0e66694cec4a1f4429335f335592688189
This could be related.
Also, anyone here tried to quantize flan-t5-large and compare it to the base variant?. As only larger models seems to be affected by this bug, it will help to know if the current T5 implementation for this project is correct
@baptistejamin Have you had any success with any project running xxl in a quantized form? We seem to running in the same threads.
So far only this project works successfully: https://bellard.org/ts_server/ts_server.html