lighteval
lighteval copied to clipboard
Winogrande degraded results
Hi,
I'm trying to reproduce the results from the OpenLLM leaderboard, and all benchmarks seem ok (within ~0.2%) except for winogrande which is consistently lower when running through lighteval.
Examples
accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=mistralai/Mistral-7B-v0.1" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 75.61
OpenLB Result: 78.37
accelerate launch --num_processes=1 run_evals_accelerate.py --model_args="pretrained=google/gemma-7b" --tasks='leaderboard|winogrande|5|0' --output_dir=lighteval_output --override_batch_size=1
LightEval Result: 73.95
OpenLB Result: 79.01
The OpenLLM results reference lighteval_sha '494ee12240e716e804ae9ea834f84a2c864c07ca'. Is that available somewhere?
Thanks