COMET
COMET copied to clipboard
Multi-system Evaluation
🚀 Feature
It will be great if COMET makes a multi-system evaluation both with Paired Bootstrap Resampling and Paired Approximate Randomization for paired significance tests.
Motivation
Paired Approximate Randomization is claimed to be more accurate than Paired Bootstrap Resampling when it comes to Type-I errors (Riezler and Maxwell III, 2005).
Alternatives
Besides Paired Bootstrap Resampling, please consider including Paired Approximate Randomization for paired significance tests.
Additional context
When reporting Paired Bootstrap Resampling test results, could you please report the standard deviations along with the means?
@andmek do you know any python implementation from this method that I can take a look at?
SacreBLEU (the current version, v2.0.0) has such implementations for BLUE, chrF, and Translation error rate (TER) metrics. I am not sure whether they are implemented purely in python.
sacrebleu is python only I believe. I'll take a look! thanks!