[QUESTION] About Z-score
Hi, I find that COMET is trained on the z-score of DA. However, I am not sure about the implementation.
Is it rescaled on the translation direction level or something else?
By essence the z-score is rescaled by annotator to make sure there are no big discrepancies. Therefore it should make scores consistent across languages but it is not so meaningfull to compare scores between languages. However since I used the same method as COMET in my estimator for EuroLLM here: https://medium.com/p/7dccfe167814 and as wmt24 provides the same English source for most language pair, you can see that scores are not so much different across traditional pairs. The question is more about: do we really trust the DA scores year after year ....