lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Added CommonsenseQA task

Open murphybrendan opened this issue 10 months ago • 3 comments

Implements #1026

murphybrendan avatar Apr 18 '24 13:04 murphybrendan

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Apr 18 '24 13:04 CLAassistant

Results from running on llama2-7b:

lm_eval --model hf --model_args pretrained=meta-llama/Llama-2-7b-chat-hf --tasks commonsense_qa

|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
|--------------|-------|------|-----:|------|-----:|---|-----:|
|commonsense_qa|Yaml   |none  |     0|acc   |0.5815|±  |0.0141|

murphybrendan avatar Apr 18 '24 15:04 murphybrendan

@murphybrendan Thank you for the PR! Can you edit the OP to include a comparison between officially reported scores on some models and replications of those scores via our library?

StellaAthena avatar Apr 18 '24 15:04 StellaAthena

Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?

f4str avatar Jun 18 '24 18:06 f4str

Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?

Just testing! If you can grab numbers from some papers that use it and compare the results with this library to what they report, that would help us move forward on merging it.

StellaAthena avatar Jun 20 '24 19:06 StellaAthena

Hello, has there been any updates on this? It would be very useful to have this task available in the library since it's been increasingly used in literature. It looks like the functionality is working, so I was wondering if there's anything left other than further testing and some linting?

Just testing! If you can grab numbers from some papers that use it and compare the results with this library to what they report, that would help us move forward on merging it.

Sure, I can help run this on some open-source models and compare the results with their corresponding paper. Due to resource limitations, I'll be limited to models around sizes 7B and below. I'll post the results here in a few days.

f4str avatar Jun 23 '24 21:06 f4str

I've ran the benchmark for some of the popular open source models which have results for CommonsenseQA in their paper. Here's a table with the results comparing the reported value in the paper with the results obtained from the PR. All results are acc ± stderr and were run using a batch size of 4 on a V100 GPU.

Model n-shots Paper Result PR Result
gemma-2b 7-shot 65.3 [1] 44.39 ± 1.42
gemma-7b 7-shot 71.3 [1] 74.94 ± 1.24
Llama-2-7b 7-shot 57.6 [2] 57.74 ± 1.41
Llama-3-8b 7-shot 72.6 [2] 73.79 ± 1.26

[1] https://huggingface.co/google/gemma-2b#benchmark-results [2] https://huggingface.co/meta-llama/Meta-Llama-3-8B#base-pretrained-models

We're looking very good for the Llama-2 and Llama-3 models, but the Gemma models seem to be a bit off. I think this might boil down to the HF implementation of Gemma (I've had issues with reproducibility in the past with these). I've been able to consistently reproduce other Llama-2 and Llama-3 benchmarks using this library, so what I'm seeing here checks out.

f4str avatar Jun 24 '24 17:06 f4str

@f4str I think this is enough to confirm and merge, seeing as it seems likely the prompt matches the Llama setting!

It's slightly unclear to me from the model card and tech report, but I suspect the Gemma discrepancy is because those numbers are Gemma-2B instruct numbers. (The gemma-2b-it model card reports the same table as the gemma-2b base model card.)

haileyschoelkopf avatar Jun 25 '24 15:06 haileyschoelkopf

Thanks @murphybrendan @f4str for your work on this!

haileyschoelkopf avatar Jun 25 '24 15:06 haileyschoelkopf