lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Winogrande Performance Discrepency

Open lintangsutawika opened this issue 1 year ago • 5 comments

          Could this PR have significant impact on Winogrande metric? I am experiencing discrepancy when using v0.4.0 and the version that HuggingFace uses for Open LLM Leaderboard (commit b281b09).

For two Llama2 models (quantized and unquantized), the Winogrande metric on v0.4.0 is significantly lower than what was produced using an older version of the harness, as shown below:

Model Harness Version Winogrande
meta-llama/llama-2-7b-chat-hf v0.4.0 66.38**
meta-llama/llama-2-7b-chat-hf b281b09 73.01
TheBloke/Llama-2-7B-Chat-GPTQ v0.4.0 65.43**
TheBloke/Llama-2-7B-Chat-GPTQ b281b09 70.80

Originally posted by @JeevanBhoot in https://github.com/EleutherAI/lm-evaluation-harness/issues/627#issuecomment-1879286269

lintangsutawika avatar Jan 06 '24 04:01 lintangsutawika

@JeevanBhoot Could you share which branch that produced 73.01? I'm testing both main and master and they both produce 0.6646

Command for main

lm-eval --model_args="pretrained=meta-llama/Llama-2-7b-chat-hf" --model hf --task=winogrande

Command for master

python main.py --model_args="pretrained=meta-llama/Llama-2-7b-chat-hf" --model hf --task=winogrande

lintangsutawika avatar Jan 06 '24 11:01 lintangsutawika

Commit b281b09 - this is the version that HuggingFace uses for the Open LLM Leaderboard.

JeevanBhoot avatar Jan 06 '24 11:01 JeevanBhoot

There is not difference between winogrande.py in both b281b09 and master. Could you check if the prompting is different in the Open LLM Leaderboard?

lintangsutawika avatar Jan 06 '24 11:01 lintangsutawika

The results in the table are not from the leaderboard. I obtained them myself using the harness with the respective versions. Has the prompting changed internally since b281b09?

preprocess_winogrande.py did not exist in b281b09 - could this be the reason?

JeevanBhoot avatar Jan 06 '24 11:01 JeevanBhoot

We've had a major refactor, so preprocess_winogrande.py does not exist in both b281b09 and master. But I'm getting the same results between master and main. Can you share the exact command you used? Can you also rerun on the master branch to make sure?

lintangsutawika avatar Jan 06 '24 11:01 lintangsutawika

The Winogrande results on OpenLLM are 5-shot. Were your evals also 5-shot @JeevanBhoot? That could suggest something changed after b281b09 in the few-shot split implementation if you're getting the same result as @lintangsutawika 's zero-shot (maybe specific to multiple-input/multiple-output tasks or difference in the sampling seed? ). Might be related to #1179.

baberabb avatar Jan 07 '24 01:01 baberabb

That makes sense. I only evaluated 0-shot. But knowing this helps pinpoint the issue as you said. Thanks. Will take another look.

lintangsutawika avatar Jan 07 '24 01:01 lintangsutawika

Am seeing the same discrepency - I tried both 0-shot and 5-shot for Winogrande on meta-llama/llama-2-7b-chat-hf and get similar results (66.63, 66.46).

chromecast56 avatar Jan 07 '24 05:01 chromecast56

Branch Num Fewshot Winogrande Accuracy
main 5 66.38
main 1 66.38
main None 66.38
master 5 73.01
master 1 68.90
master None 66.38

The results on main are identical to the results on v0.4.0, and the results on master are identical to the results on b281b09.

Commands:

### main ###
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 5 --batch_size 1

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 1 --batch_size 1

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --batch_size 1

### master ###
python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 5 --batch_size 1

python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 1 --batch_size 1

python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --batch_size 1

JeevanBhoot avatar Jan 07 '24 14:01 JeevanBhoot

Made a fix in #1255

hf (pretrained=meta-llama/Llama-2-7b-chat-hf), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|  Tasks   |Version|Filter|n-shot|Metric|Value |   |Stderr|
|----------|-------|------|-----:|------|-----:|---|-----:|
|winogrande|Yaml   |none  |     5|acc   |0.7245|±  |0.0126|

lintangsutawika avatar Jan 08 '24 04:01 lintangsutawika