lm-evaluation-harness Winogrande Performance Discrepency

          Could this PR have significant impact on Winogrande metric? I am experiencing discrepancy when using v0.4.0 and the version that HuggingFace uses for Open LLM Leaderboard (commit b281b09).

For two Llama2 models (quantized and unquantized), the Winogrande metric on v0.4.0 is significantly lower than what was produced using an older version of the harness, as shown below:

Model	Harness Version	Winogrande
meta-llama/llama-2-7b-chat-hf	v0.4.0	66.38**
meta-llama/llama-2-7b-chat-hf	b281b09	73.01
TheBloke/Llama-2-7B-Chat-GPTQ	v0.4.0	65.43**
TheBloke/Llama-2-7B-Chat-GPTQ	b281b09	70.80

Originally posted by @JeevanBhoot in https://github.com/EleutherAI/lm-evaluation-harness/issues/627#issuecomment-1879286269

Jan 06 '24 04:01 lintangsutawika

@JeevanBhoot Could you share which branch that produced 73.01? I'm testing both main and master and they both produce 0.6646

Command for main

lm-eval --model_args="pretrained=meta-llama/Llama-2-7b-chat-hf" --model hf --task=winogrande

Command for master

python main.py --model_args="pretrained=meta-llama/Llama-2-7b-chat-hf" --model hf --task=winogrande

Jan 06 '24 11:01 lintangsutawika

Commit b281b09 - this is the version that HuggingFace uses for the Open LLM Leaderboard.

Jan 06 '24 11:01 JeevanBhoot

There is not difference between winogrande.py in both b281b09 and master. Could you check if the prompting is different in the Open LLM Leaderboard?

Jan 06 '24 11:01 lintangsutawika

The results in the table are not from the leaderboard. I obtained them myself using the harness with the respective versions. Has the prompting changed internally since b281b09?

preprocess_winogrande.py did not exist in b281b09 - could this be the reason?

Jan 06 '24 11:01 JeevanBhoot

We've had a major refactor, so preprocess_winogrande.py does not exist in both b281b09 and master. But I'm getting the same results between master and main. Can you share the exact command you used? Can you also rerun on the master branch to make sure?

Jan 06 '24 11:01 lintangsutawika

The Winogrande results on OpenLLM are 5-shot. Were your evals also 5-shot @JeevanBhoot? That could suggest something changed after b281b09 in the few-shot split implementation if you're getting the same result as @lintangsutawika 's zero-shot (maybe specific to multiple-input/multiple-output tasks or difference in the sampling seed? ). Might be related to #1179.

Jan 07 '24 01:01 baberabb

That makes sense. I only evaluated 0-shot. But knowing this helps pinpoint the issue as you said. Thanks. Will take another look.

Jan 07 '24 01:01 lintangsutawika

Am seeing the same discrepency - I tried both 0-shot and 5-shot for Winogrande on meta-llama/llama-2-7b-chat-hf and get similar results (66.63, 66.46).

Jan 07 '24 05:01 chromecast56

Branch Num Fewshot Winogrande Accuracy

main 5 66.38

main 1 66.38

main None 66.38

master 5 73.01

master 1 68.90

master None 66.38

Branch	Num Fewshot	Winogrande Accuracy
main	5	66.38
main	1	66.38
main	None	66.38
master	5	73.01
master	1	68.90
master	None	66.38

The results on main are identical to the results on v0.4.0, and the results on master are identical to the results on b281b09.

Commands:

### main ###
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 5 --batch_size 1

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 1 --batch_size 1

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --batch_size 1

### master ###
python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 5 --batch_size 1

python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 1 --batch_size 1

python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --batch_size 1

Jan 07 '24 14:01 JeevanBhoot

Made a fix in #1255

hf (pretrained=meta-llama/Llama-2-7b-chat-hf), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|-------|------|-----:|------|-----:|---|-----:|
|winogrande|Yaml   |none  |     5|acc   |0.7245|±  |0.0126|

Jan 08 '24 04:01 lintangsutawika

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Winogrande Performance Discrepency

lm-evaluation-harness
lm-evaluation-harness copied to clipboard