GPTQ-for-LLaMa Bad results for WinoGrande

Bad results for WinoGrande - more testers searched

Open DanielWe2 opened this issue 2 years ago • 1 comments

Facebook published posted expected results for the WinoGrande test with a score of 70 for the 7B model.

I wrote a small script see #40 that fetches the dataset from datasets and runs the tests.

Because the prompt and parameters were not published (see https://github.com/facebookresearch/llama/issues/188) I wrote a prompt myself. It is probably not very good but was the only version that was working at all.

The problem: With the 4bit 7B model I only get about 48% that means the model is not better than random..

So something is off. One or more of:

Wrong parameters
Bad prompt
Something else is wrong in my script
quantization hurts model performance
Bug in implementation of quant or inference

As I am new to this topic, it can very well be a problem on my end.

So I would like to get help fixing the prompt/script and would also like to see results for other models:

other model versions
other quantizations

Mar 13 '23 20:03 DanielWe2

OK an update:

It looks like the 7B 16bit model, the 13B 4bit and 30B 4bit all have the same issue. I think we need another prompt or parameters.

Mar 13 '23 23:03 DanielWe2

GPTQ-for-LLaMa GPTQ-for-LLaMa copied to clipboard

Bad results for WinoGrande - more testers searched

GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard