lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

How to interpret generated results for truthful_qa test

Open Joetib opened this issue 1 year ago • 0 comments

Hello, I executed truthful_qa benchmarks on a few models.

However, the documentation does not explain how to interpret results.

Below is a sample result I achieved in one of my benchmarks

Task Version Metric Value Stderr
truthfulqa_gen 1 bleurt_max -0.5266 ± 0.0171
bleurt_acc 0.4590 ± 0.0174
bleurt_diff -0.0073 ± 0.0229
bleu_max 30.2628 ± 0.7998
bleu_acc 0.4382 ± 0.0174
bleu_diff 0.5205 ± 0.9126
rouge1_max 59.4502 ± 0.8515
rouge1_acc 0.4370 ± 0.0174
rouge1_diff 1.5690 ± 1.1883
rouge2_max 44.9280 ± 1.0439
rouge2_acc 0.4051 ± 0.0172
rouge2_diff 0.9125 ± 1.3574
rougeL_max 56.0939 ± 0.8891
rougeL_acc 0.4186 ± 0.0173
rougeL_diff 0.9057 ± 1.2061

Which value do I use as my truthful_qa benchmark score on a hugging face repo? I'll be glad if anyone could help me out with this

Joetib avatar Nov 15 '23 22:11 Joetib