lm-evaluation-harness How to interpret generated results for truthful

How to interpret generated results for truthful_qa test

Open Joetib opened this issue 1 year ago • 0 comments

Hello, I executed truthful_qa benchmarks on a few models.

However, the documentation does not explain how to interpret results.

Below is a sample result I achieved in one of my benchmarks

Task	Version	Metric	Value		Stderr
truthfulqa_gen	1	bleurt_max	-0.5266	±	0.0171
		bleurt_acc	0.4590	±	0.0174
		bleurt_diff	-0.0073	±	0.0229
		bleu_max	30.2628	±	0.7998
		bleu_acc	0.4382	±	0.0174
		bleu_diff	0.5205	±	0.9126
		rouge1_max	59.4502	±	0.8515
		rouge1_acc	0.4370	±	0.0174
		rouge1_diff	1.5690	±	1.1883
		rouge2_max	44.9280	±	1.0439
		rouge2_acc	0.4051	±	0.0172
		rouge2_diff	0.9125	±	1.3574
		rougeL_max	56.0939	±	0.8891
		rougeL_acc	0.4186	±	0.0173
		rougeL_diff	0.9057	±	1.2061

Which value do I use as my truthful_qa benchmark score on a hugging face repo? I'll be glad if anyone could help me out with this

Nov 15 '23 22:11 Joetib