COMET icon indicating copy to clipboard operation
COMET copied to clipboard

Multi-system Evaluation

Open andmek opened this issue 2 years ago • 3 comments

🚀 Feature

It will be great if COMET makes a multi-system evaluation both with Paired Bootstrap Resampling and Paired Approximate Randomization for paired significance tests.

Motivation

Paired Approximate Randomization is claimed to be more accurate than Paired Bootstrap Resampling when it comes to Type-I errors (Riezler and Maxwell III, 2005).

Alternatives

Besides Paired Bootstrap Resampling, please consider including Paired Approximate Randomization for paired significance tests.

Additional context

When reporting Paired Bootstrap Resampling test results, could you please report the standard deviations along with the means?

andmek avatar Mar 19 '22 12:03 andmek

@andmek do you know any python implementation from this method that I can take a look at?

ricardorei avatar Mar 25 '22 17:03 ricardorei

SacreBLEU (the current version, v2.0.0) has such implementations for BLUE, chrF, and Translation error rate (TER) metrics. I am not sure whether they are implemented purely in python.

andmek avatar Mar 26 '22 07:03 andmek

sacrebleu is python only I believe. I'll take a look! thanks!

ricardorei avatar Mar 26 '22 16:03 ricardorei