lm-evaluation-harness
lm-evaluation-harness copied to clipboard
How to interpret generated results for truthful_qa test
Hello, I executed truthful_qa benchmarks on a few models.
However, the documentation does not explain how to interpret results.
Below is a sample result I achieved in one of my benchmarks
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_gen | 1 | bleurt_max | -0.5266 | ± | 0.0171 |
bleurt_acc | 0.4590 | ± | 0.0174 | ||
bleurt_diff | -0.0073 | ± | 0.0229 | ||
bleu_max | 30.2628 | ± | 0.7998 | ||
bleu_acc | 0.4382 | ± | 0.0174 | ||
bleu_diff | 0.5205 | ± | 0.9126 | ||
rouge1_max | 59.4502 | ± | 0.8515 | ||
rouge1_acc | 0.4370 | ± | 0.0174 | ||
rouge1_diff | 1.5690 | ± | 1.1883 | ||
rouge2_max | 44.9280 | ± | 1.0439 | ||
rouge2_acc | 0.4051 | ± | 0.0172 | ||
rouge2_diff | 0.9125 | ± | 1.3574 | ||
rougeL_max | 56.0939 | ± | 0.8891 | ||
rougeL_acc | 0.4186 | ± | 0.0173 | ||
rougeL_diff | 0.9057 | ± | 1.2061 |
Which value do I use as my truthful_qa benchmark score on a hugging face repo? I'll be glad if anyone could help me out with this