GPTQ-for-LLaMa T5 Benchmark

Thank you for the repo.

I am curious what benchmark results (MMLU and BBH) we shall expect for the gptq-flan-t5 models. I am getting an average accuracy of 25.2% for MMLU using the xl version (4bit, 128 groupsize). It seems a bit far off from the results posted on flan-eval. Am I missing something?

Apr 10 '23 21:04 ghost

Quantization doesn't seem to work well for unknown reasons. I tried many things to solve this but it still doesn't work. Because of this, I am not merging this now

Apr 10 '23 21:04 qwopqwop200

Ah. Got it~ Thank you. Hope someone can provide insights on this.

Apr 10 '23 21:04 jasontian6666

y... I was noticing same issue comparing to int8 quantization performance is not that good yet....

Apr 10 '23 22:04 johnrobinsn

I'm comparing flan-t5-xxl int8 quant using this script... (will fit on a 24G card) https://github.com/johnrobinsn/flan_ul2/blob/main/infer-flan-ul2-int8-basemodel.py

against flan-t5-xxl int4 using this script... https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/t5/t5_inference.py

and y... pretty bad results with int4... in the morning I'll pull the benchmark into my int8 script so I can get some numbers to compare...

and try to look deeper at the quantization code.

Apr 10 '23 22:04 johnrobinsn

@jasontian6666 by flan-eval you're talking about this..? thx

https://github.com/declare-lab/flan-eval

Apr 10 '23 23:04 johnrobinsn

@jasontian6666 by flan-eval you're talking about this..? thx

https://github.com/declare-lab/flan-eval

Yes, this is the repo I'm referring to. Are their results based on int8?

Apr 10 '23 23:04 jasontian6666

they have a --load_8bit flag that I'm trying now with flan-t5-xxl

Apr 10 '23 23:04 johnrobinsn

@jasontian6666 can you share your code for everyone to check?

Apr 11 '23 03:04 phanxuanphucnd

@jasontian6666 can you share your code for everyone to check?

I just use the exact code that the author provided in the readme for conversion and benchmark (simply replacing small model with xl).

Apr 11 '23 04:04 jasontian6666

@jasontian6666 do you still keep the llama code in the source code to run for T5 model?

Apr 11 '23 04:04 phanxuanphucnd

@jasontian6666 do you still keep the llama code in the source code to run for T5 model?

Are we talking about the code in the t5 branch?

Apr 11 '23 05:04 jasontian6666

@jasontian6666 oki, let's check together

Apr 11 '23 08:04 phanxuanphucnd

Just collecting some details on the performance between int8 quant vs int4 quant for t5 models (not llama)

Using flan-eval to eval int8 quant performance int8;flan-t5-xxl;mmlu => Average accuracy: 0.544 int8;flan-t5-xxl;bbh => 'average': 0.4379700192171589

using the t5 branch with python t5.py google/flan-t5-xxl wikitext2 --wbits 4 --act-order --groupsize 128 ... int4;flan-t5-xxl;mmlu => Average accuracy: 0.237 int4;flan-t5-xxl;bbh => Average accuracy: 0.203

pretty big loss in performance as it stands.

Apr 11 '23 11:04 johnrobinsn

Following - also seeing very degraded performance on a flan based model that has downstream finetuning. Have you all tried experimenting with the nsamples or percdamp parameters?

Anecdotally - I notice the error prints during quantization in 8 bit are 3-4 magnitudes of order higher than 8 bit, particularly demonstrated in the dense layers.

Apr 11 '23 17:04 bradfox2

The flan models are already finetuned... so even before additional finetuning... seeing the degradation as captured above.

@bradfox2 But that's a good thought; increasing the sampling rate etc, might improve the quality of the quantization quite a bit. Definitely worth some experiments.

@qwopqwop200 did you try tweaking those parameters and see any improvement?

I'm somewhat arbitrarily trying (bumping up the percdamp and nsamples defaults) CUDA_VISIBLE_DEVICES=1 python t5.py google/flan-t5-xxl wikitext2 --wbits 4 --act-order --groupsize 128 --percdamp 0.1 --nsamples 256 --save t5-xxl-4bit-128g-ns256-da10.pt

Using these params got a negligible lift... int4;flan-t5-xxl;mmlu => Average accuracy: 0.276

Apr 11 '23 17:04 johnrobinsn

For MMLU benchmark, I have Flan-t5-large:

float16: 41.9%
8bit: 36.8%
4bit: 35.1%

Flan-t5-xl:

float16: 49.3%
8bit: 25.4%
4bit: 25.2%

Not much lost of performance from 8bit to 4bit. But quantization seems to hurt.

Apr 13 '23 17:04 jasontian6666

I'm getting 0.294 for FLAN-T5-XXL with --act-order --groupsize 128 --percdamp .05 --nsamples 512. Recommend to try higher nsamples as per GPTQ authors recommendation for FLAN-T5-XXL.

Apr 17 '23 09:04 billcai

@jasontian6666 @bradfox2 watching this https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/189

might help t5

Apr 18 '23 13:04 johnrobinsn

@jasontian6666 @bradfox2 watching this https://github.com/qwopqwop200/GPTQ-for-LLaMa/pull/189

might help t5

Thank you. Will keep an eye on it. Hope it can help.

Apr 18 '23 13:04 jasontian6666

@johnrobinsn thanks. Are you planning on testing again with the merged code?

Apr 18 '23 22:04 bradfox2

Quant error analysis on LLaMa-7B, hope it helps @bradfox2 @bradfox2 cc @jasontian6666 @qwopqwop200

More tests are still on the way QvQ

Apr 19 '23 10:04 tpoisonooo

@bradfox2 y... I would like to, although might take me a week or so to circle back to it. will need to merged into the t5 branch...

Apr 19 '23 13:04 johnrobinsn

Flan T5 has some weight on FP32 for large variants (XL and XXL). Transformers lib has a patch for this: https://github.com/huggingface/transformers/commit/b9b70b0e66694cec4a1f4429335f335592688189

This could be related.

Also, anyone here tried to quantize flan-t5-large and compare it to the base variant?. As only larger models seems to be affected by this bug, it will help to know if the current T5 implementation for this project is correct

May 10 '23 10:05 baptistejamin

@baptistejamin Have you had any success with any project running xxl in a quantized form? We seem to running in the same threads.

Jun 13 '23 14:06 bradfox2

So far only this project works successfully: https://bellard.org/ts_server/ts_server.html

Jun 13 '23 15:06 baptistejamin

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

T5 Benchmark

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard