lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Winogrande Performance Discrepency
Could this PR have significant impact on Winogrande metric? I am experiencing discrepancy when using v0.4.0 and the version that HuggingFace uses for Open LLM Leaderboard (commit b281b09).
For two Llama2 models (quantized and unquantized), the Winogrande metric on v0.4.0 is significantly lower than what was produced using an older version of the harness, as shown below:
Model | Harness Version | Winogrande |
---|---|---|
meta-llama/llama-2-7b-chat-hf | v0.4.0 | 66.38** |
meta-llama/llama-2-7b-chat-hf | b281b09 | 73.01 |
TheBloke/Llama-2-7B-Chat-GPTQ | v0.4.0 | 65.43** |
TheBloke/Llama-2-7B-Chat-GPTQ | b281b09 | 70.80 |
Originally posted by @JeevanBhoot in https://github.com/EleutherAI/lm-evaluation-harness/issues/627#issuecomment-1879286269
@JeevanBhoot Could you share which branch that produced 73.01? I'm testing both main
and master
and they both produce 0.6646
Command for main
lm-eval --model_args="pretrained=meta-llama/Llama-2-7b-chat-hf" --model hf --task=winogrande
Command for master
python main.py --model_args="pretrained=meta-llama/Llama-2-7b-chat-hf" --model hf --task=winogrande
Commit b281b09 - this is the version that HuggingFace uses for the Open LLM Leaderboard.
There is not difference between winogrande.py
in both b281b09 and master
. Could you check if the prompting is different in the Open LLM Leaderboard?
The results in the table are not from the leaderboard. I obtained them myself using the harness with the respective versions. Has the prompting changed internally since b281b09?
preprocess_winogrande.py
did not exist in b281b09
- could this be the reason?
We've had a major refactor, so preprocess_winogrande.py
does not exist in both b281b09
and master
. But I'm getting the same results between master
and main
. Can you share the exact command you used? Can you also rerun on the master
branch to make sure?
The Winogrande results on OpenLLM are 5-shot. Were your evals also 5-shot @JeevanBhoot? That could suggest something changed after b281b09
in the few-shot split implementation if you're getting the same result as @lintangsutawika 's zero-shot (maybe specific to multiple-input/multiple-output tasks or difference in the sampling seed? ). Might be related to #1179.
That makes sense. I only evaluated 0-shot. But knowing this helps pinpoint the issue as you said. Thanks. Will take another look.
Am seeing the same discrepency - I tried both 0-shot and 5-shot for Winogrande on meta-llama/llama-2-7b-chat-hf
and get similar results (66.63, 66.46).
Branch | Num Fewshot | Winogrande Accuracy |
---|---|---|
main | 5 | 66.38 |
main | 1 | 66.38 |
main | None | 66.38 |
master | 5 | 73.01 |
master | 1 | 68.90 |
master | None | 66.38 |
The results on main
are identical to the results on v0.4.0, and the results on master
are identical to the results on b281b09.
Commands:
### main ###
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 5 --batch_size 1
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 1 --batch_size 1
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --batch_size 1
### master ###
python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 5 --batch_size 1
python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --num_fewshot 1 --batch_size 1
python main.py --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks winogrande --device cuda:0 --batch_size 1
Made a fix in #1255
hf (pretrained=meta-llama/Llama-2-7b-chat-hf), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|----------|-------|------|-----:|------|-----:|---|-----:|
|winogrande|Yaml |none | 5|acc |0.7245|± |0.0126|