lm-evaluation-harness different score ranges are confusing

different score ranges are confusing

Open Muennighoff opened this issue 2 years ago • 2 comments

it's confusing bleu scores are 0-100 & rouge 0-1 in this repo; I think either all scores should 0-100 or 0-1, probably the former

Aug 16 '22 06:08 Muennighoff

I agree that standardizing on the $[0, 100]$ range is ideal for the readability of these scores. The difference here is that the underlying sacreblue package scales BLEU/TER/chrF scores by $100$. These are the only metrics in the harness that are scaled (accuracy, ROUGE, SARI, etc. are not). So, to make everything consistent for now, we can re-scale BLEU back to its "natural" units in $[0, 1]$ and follow up with an optional per-metric "results-formatter". What do you think?

Aug 18 '22 20:08 jon-tow

I think this suggestion makes a lot of sense. Additionally, it would be nice to have the option to get rounded answers, e.g., 17.7%.

Sep 22 '22 20:09 StellaAthena

lm-evaluation-harness lm-evaluation-harness copied to clipboard

different score ranges are confusing

lm-evaluation-harness
lm-evaluation-harness copied to clipboard