lm-evaluation-harness Added CommonsenseQA task

Implements #1026

Apr 18 '24 13:04 murphybrendan

All committers have signed the CLA.

Apr 18 '24 13:04 CLAassistant

Results from running on llama2-7b:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks commonsense_qa

|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
|--------------|-------|------|-----:|------|-----:|---|-----:|
|commonsense_qa|Yaml   |none  |     0|acc   |0.5815|±  |0.0141|

Apr 18 '24 15:04 murphybrendan

@murphybrendan Thank you for the PR! Can you edit the OP to include a comparison between officially reported scores on some models and replications of those scores via our library?

Apr 18 '24 15:04 StellaAthena

Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?

Jun 18 '24 18:06 f4str

Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?

Just testing! If you can grab numbers from some papers that use it and compare the results with this library to what they report, that would help us move forward on merging it.

Jun 20 '24 19:06 StellaAthena

Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?

Just testing! If you can grab numbers from some papers that use it and compare the results with this library to what they report, that would help us move forward on merging it.

Sure, I can help run this on some open-source models and compare the results with their corresponding paper. Due to resource limitations, I'll be limited to models around sizes 7B and below. I'll post the results here in a few days.

Jun 23 '24 21:06 f4str

I've ran the benchmark for some of the popular open source models which have results for CommonsenseQA in their paper. Here's a table with the results comparing the reported value in the paper with the results obtained from the PR. All results are acc ± stderr and were run using a batch size of 4 on a V100 GPU.

Model	n-shots	Paper Result	PR Result
gemma-2b	7-shot	65.3 [1]	44.39 ± 1.42
gemma-7b	7-shot	71.3 [1]	74.94 ± 1.24
Llama-2-7b	7-shot	57.6 [2]	57.74 ± 1.41
Llama-3-8b	7-shot	72.6 [2]	73.79 ± 1.26

[1] https://huggingface.co/google/gemma-2b#benchmark-results [2] https://huggingface.co/meta-llama/Meta-Llama-3-8B#base-pretrained-models

We're looking very good for the Llama-2 and Llama-3 models, but the Gemma models seem to be a bit off. I think this might boil down to the HF implementation of Gemma (I've had issues with reproducibility in the past with these). I've been able to consistently reproduce other Llama-2 and Llama-3 benchmarks using this library, so what I'm seeing here checks out.

Jun 24 '24 17:06 f4str

@f4str I think this is enough to confirm and merge, seeing as it seems likely the prompt matches the Llama setting!

It's slightly unclear to me from the model card and tech report, but I suspect the Gemma discrepancy is because those numbers are Gemma-2B instruct numbers. (The gemma-2b-it model card reports the same table as the gemma-2b base model card.)

Jun 25 '24 15:06 haileyschoelkopf

Thanks @murphybrendan @f4str for your work on this!

Jun 25 '24 15:06 haileyschoelkopf

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Added CommonsenseQA task

lm-evaluation-harness
lm-evaluation-harness copied to clipboard