lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Added CommonsenseQA task
Implements #1026
Results from running on llama2-7b:
lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks commonsense_qa
| Tasks |Version|Filter|n-shot|Metric|Value | |Stderr|
|--------------|-------|------|-----:|------|-----:|---|-----:|
|commonsense_qa|Yaml |none | 0|acc |0.5815|± |0.0141|
@murphybrendan Thank you for the PR! Can you edit the OP to include a comparison between officially reported scores on some models and replications of those scores via our library?
Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?
Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?
Just testing! If you can grab numbers from some papers that use it and compare the results with this library to what they report, that would help us move forward on merging it.
Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?
Just testing! If you can grab numbers from some papers that use it and compare the results with this library to what they report, that would help us move forward on merging it.
Sure, I can help run this on some open-source models and compare the results with their corresponding paper. Due to resource limitations, I'll be limited to models around sizes 7B and below. I'll post the results here in a few days.
I've ran the benchmark for some of the popular open source models which have results for CommonsenseQA in their paper. Here's a table with the results comparing the reported value in the paper with the results obtained from the PR. All results are acc ± stderr
and were run using a batch size of 4 on a V100 GPU.
Model | n-shots | Paper Result | PR Result |
---|---|---|---|
gemma-2b | 7-shot | 65.3 [1] | 44.39 ± 1.42 |
gemma-7b | 7-shot | 71.3 [1] | 74.94 ± 1.24 |
Llama-2-7b | 7-shot | 57.6 [2] | 57.74 ± 1.41 |
Llama-3-8b | 7-shot | 72.6 [2] | 73.79 ± 1.26 |
[1] https://huggingface.co/google/gemma-2b#benchmark-results [2] https://huggingface.co/meta-llama/Meta-Llama-3-8B#base-pretrained-models
We're looking very good for the Llama-2 and Llama-3 models, but the Gemma models seem to be a bit off. I think this might boil down to the HF implementation of Gemma (I've had issues with reproducibility in the past with these). I've been able to consistently reproduce other Llama-2 and Llama-3 benchmarks using this library, so what I'm seeing here checks out.
@f4str I think this is enough to confirm and merge, seeing as it seems likely the prompt matches the Llama setting!
It's slightly unclear to me from the model card and tech report, but I suspect the Gemma discrepancy is because those numbers are Gemma-2B instruct numbers. (The gemma-2b-it model card reports the same table as the gemma-2b base model card.)
Thanks @murphybrendan @f4str for your work on this!