GPTQ-for-LLaMa icon indicating copy to clipboard operation
GPTQ-for-LLaMa copied to clipboard

Bad results for WinoGrande - more testers searched

Open DanielWe2 opened this issue 1 year ago • 1 comments

Facebook published posted expected results for the WinoGrande test with a score of 70 for the 7B model.

I wrote a small script see #40 that fetches the dataset from datasets and runs the tests.

Because the prompt and parameters were not published (see https://github.com/facebookresearch/llama/issues/188) I wrote a prompt myself. It is probably not very good but was the only version that was working at all.

The problem: With the 4bit 7B model I only get about 48% that means the model is not better than random..

So something is off. One or more of:

  • Wrong parameters
  • Bad prompt
  • Something else is wrong in my script
  • quantization hurts model performance
  • Bug in implementation of quant or inference

As I am new to this topic, it can very well be a problem on my end.

So I would like to get help fixing the prompt/script and would also like to see results for other models:

  • other model versions
  • other quantizations

DanielWe2 avatar Mar 13 '23 20:03 DanielWe2

OK an update:

It looks like the 7B 16bit model, the 13B 4bit and 30B 4bit all have the same issue. I think we need another prompt or parameters.

DanielWe2 avatar Mar 13 '23 23:03 DanielWe2