opencompass
opencompass copied to clipboard
[Feature] add acc_norm evaluation
Describe the feature
lm-evaluation-harness supports acc_norm evaluation, which is used in huggingface leaderboard
ARC: 25-shot, arc-challenge (acc_norm)
HellaSwag: 10-shot, hellaswag (acc_norm)
acc_norm is calculated by the result (answer logits sum) divided by answer length
acc_norm = 1.0 if np.argmax(results / completion_len) == gold else 0.0
In ARC and hellaswag datasets, different answers have different lengths, so longer answers are likely to have larger logits sum, so it should be normalized with answer length to give a accurate prediction.
Will you implement it?
- [ ] I would like to implement this feature and create a PR!