GPTQ-for-LLaMa
GPTQ-for-LLaMa copied to clipboard
Bad results for WinoGrande - more testers searched
Facebook published posted expected results for the WinoGrande test with a score of 70 for the 7B model.
I wrote a small script see #40 that fetches the dataset from datasets and runs the tests.
Because the prompt and parameters were not published (see https://github.com/facebookresearch/llama/issues/188) I wrote a prompt myself. It is probably not very good but was the only version that was working at all.
The problem: With the 4bit 7B model I only get about 48% that means the model is not better than random..
So something is off. One or more of:
- Wrong parameters
- Bad prompt
- Something else is wrong in my script
- quantization hurts model performance
- Bug in implementation of quant or inference
As I am new to this topic, it can very well be a problem on my end.
So I would like to get help fixing the prompt/script and would also like to see results for other models:
- other model versions
- other quantizations
OK an update:
It looks like the 7B 16bit model, the 13B 4bit and 30B 4bit all have the same issue. I think we need another prompt or parameters.