multeval icon indicating copy to clipboard operation
multeval copied to clipboard

different p-values compared to other tools

Open dimitarsh1 opened this issue 5 years ago • 0 comments

Hi,

I typically run multeval for bleu and ter but haven't assessed statistical significance so far. Now that I actually need it , I find it (1) difficult to grasp what exactly multeval computes (I checked issue #8 and it clarifies somehow what is going on) and (2) to run it 'correctly'. With (1) what I mean is that according to Koehn's paper (https://www.aclweb.org/anthology/W04-3250) I would assume you take different samples from sys1 and sys2 score w.r.t. the reference and assess the differences. If in 95% of the cases the scores differ favouring one of the systems then the difference is statistically significant. Or am I getting it wrong? Furthermore, I compared the multeval tool to mteval for the same number of samples and shuffles and the scores are completely different. 2. Maybe this all comes from me not running multeval correctly. I have one reference and the output of two MT systems. As multeval doesn't like it when there is only one variant for system 1 and the baseline I use copies, e.g. for system 1 I will use sys1.test.out and sys1.test.out.copy (and they are identical). Is this a good way to invoke multeval?

Thanks. Cheers, Dimtiar

dimitarsh1 avatar Jun 18 '19 11:06 dimitarsh1