nltk icon indicating copy to clipboard operation
nltk copied to clipboard

Fringe cases in MT eval metrics

Open alvations opened this issue 7 years ago • 6 comments

There are several fringe that still bugs the MT evaluation metrics in nltk.translate.

The BLEU related issues are mostly resolved in #1330. But similar issues happens in RIBES and CHRF too:

  • ribes_score.py

    • https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L290 and https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L320 are subjected to ZeroDivisionError when the no. of possible ngram pairs is 0
  • chrf_score.py

    • The interface for the references for other scores supports multi-reference by default while ChrF score supports single reference. It should be standardize to accommodate multi-reference
      • But in the case of multi-reference score, there's no indication of which reference to choose in ChrF, we might need to contact the author to understand how to handle this.

alvations avatar Jul 26 '17 02:07 alvations

In the Maja Popovic's implementation, when multiple references are provided, the one which leads to the highest f-score is used, see:

https://github.com/m-popovic/chrF/blob/master/chrF%2B%2B.py#L155

ales-t avatar Jul 27 '18 15:07 ales-t

There seems to be a bug in computing bleu when one but not all of the sentences are shorter than the max n-gram length. For example, the following test case should get bleu of 1.0 but does not:

references = [['John loves Mary'.split()], ['John still loves Mary'.split()]]
hypothesis = ['John loves Mary'.split(), 'John still loves Mary'.split()]
n = 4  #
weights = [1.0 / n] * n  # Uniform weights.
print    (corpus_bleu(references, hypothesis, weights))

A sentence of length 3 that is identical to the reference is scored as having 0 out of 1 correct 4-grams, instead of 0 out of 0 correct 4-grams.

Suggested patch:


--- a/nltk/translate/bleu_score.py
+++ b/nltk/translate/bleu_score.py
@@ -183,6 +183,8 @@ def corpus_bleu(
         # denominator for the corpus-level modified precision.
         for i, _ in enumerate(weights, start=1):
             p_i = modified_precision(references, hypothesis, i)
+            if (p_i == None):
+                continue       # no ngrams because ref was shorter than i
             p_numerators[i] += p_i.numerator
             p_denominators[i] += p_i.denominator
 
@@ -240,6 +242,7 @@ def modified_precision(references, hypothesis, n):
     and denominator necessary to calculate the corpus-level precision.
     To calculate the modified precision for a single pair of hypothesis and
     references, cast the Fraction object into a float.
+    Returns None if references are shorter than n.
 
     The famous "the the the ... " example shows that you can get BLEU precision
     by duplicating high frequency words.
@@ -332,9 +335,10 @@ def modified_precision(references, hypothesis, n):
     }
 
     numerator = sum(clipped_counts.values())
-    # Ensures that denominator is minimum 1 to avoid ZeroDivisionError.
-    # Usually this happens when the ngram order is > len(reference).
-    denominator = max(1, sum(counts.values()))
+    denominator = sum(counts.values())
+    if denominator == 0:
+        # avoid div by zero when the ngram order is > len(reference)
+        return None
 
     return Fraction(numerator, denominator, _normalize=False)
 

danielgildea avatar May 29 '19 15:05 danielgildea

@bmaland Previously, before @bamattsson contribution to https://github.com/nltk/nltk/pull/1844, NLTK's BLEU does some tricks to makes sure that exact string matches gives the 1.0 result but post #1844, the BLEU scores in NLTK is similar to that of what can be found in https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

Do note that BLEU score is meant to be corpus-level metric rather than a sentence level one. This is a good paper describing related issue: https://arxiv.org/pdf/1804.08771.pdf

alvations avatar May 30 '19 07:05 alvations

If you're unsure whether you have short strings in your list, feel free to use the auto-reweigh feature, e.g.

>>> from nltk.translate import bleu
>>> references = ['John loves Mary'.split(), 'John still loves Mary'.split()]
>>> hypothesis = 'John loves Mary'.split()
>>> bleu(references, hypothesis)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/translate/bleu_score.py:523: UserWarning: 
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
1.2213386697554703e-77
>>> bleu(references, hypothesis, auto_reweigh=True)
1.0

alvations avatar May 30 '19 07:05 alvations

In my example above, multi-bleu.pl gives 100.0, but nltk gives 0.84. This is a case where the hyp does have some matching 4-grams, but not every sentence in the ref is of length four or greater.

danielgildea avatar May 30 '19 10:05 danielgildea

is this still open =?

Higgs32584 avatar Dec 22 '23 21:12 Higgs32584