COMET Multi-system Evaluation

Multi-system Evaluation

Open andmek opened this issue 2 years ago • 3 comments

🚀 Feature

It will be great if COMET makes a multi-system evaluation both with Paired Bootstrap Resampling and Paired Approximate Randomization for paired significance tests.

Motivation

Paired Approximate Randomization is claimed to be more accurate than Paired Bootstrap Resampling when it comes to Type-I errors (Riezler and Maxwell III, 2005).

Alternatives

Besides Paired Bootstrap Resampling, please consider including Paired Approximate Randomization for paired significance tests.

Additional context

When reporting Paired Bootstrap Resampling test results, could you please report the standard deviations along with the means?

Mar 19 '22 12:03 andmek

@andmek do you know any python implementation from this method that I can take a look at?

Mar 25 '22 17:03 ricardorei

SacreBLEU (the current version, v2.0.0) has such implementations for BLUE, chrF, and Translation error rate (TER) metrics. I am not sure whether they are implemented purely in python.

Mar 26 '22 07:03 andmek

sacrebleu is python only I believe. I'll take a look! thanks!

Mar 26 '22 16:03 ricardorei

COMET COMET copied to clipboard

Multi-system Evaluation

🚀 Feature

Motivation

Alternatives

Additional context

COMET
COMET copied to clipboard