nltk Fringe cases in MT eval metrics

There are several fringe that still bugs the MT evaluation metrics in nltk.translate.

The BLEU related issues are mostly resolved in #1330. But similar issues happens in RIBES and CHRF too:

ribes_score.py
- https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L290 and https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L320 are subjected to ZeroDivisionError when the no. of possible ngram pairs is 0
chrf_score.py
- The interface for the references for other scores supports multi-reference by default while ChrF score supports single reference. It should be standardize to accommodate multi-reference
  - But in the case of multi-reference score, there's no indication of which reference to choose in ChrF, we might need to contact the author to understand how to handle this.

Jul 26 '17 02:07 alvations

In the Maja Popovic's implementation, when multiple references are provided, the one which leads to the highest f-score is used, see:

https://github.com/m-popovic/chrF/blob/master/chrF%2B%2B.py#L155

Jul 27 '18 15:07 ales-t

There seems to be a bug in computing bleu when one but not all of the sentences are shorter than the max n-gram length. For example, the following test case should get bleu of 1.0 but does not:

references = [['John loves Mary'.split()], ['John still loves Mary'.split()]]
hypothesis = ['John loves Mary'.split(), 'John still loves Mary'.split()]
n = 4  #
weights = [1.0 / n] * n  # Uniform weights.
print    (corpus_bleu(references, hypothesis, weights))

A sentence of length 3 that is identical to the reference is scored as having 0 out of 1 correct 4-grams, instead of 0 out of 0 correct 4-grams.

Suggested patch:


--- a/nltk/translate/bleu_score.py
+++ b/nltk/translate/bleu_score.py
@@ -183,6 +183,8 @@ def corpus_bleu(
         # denominator for the corpus-level modified precision.
         for i, _ in enumerate(weights, start=1):
             p_i = modified_precision(references, hypothesis, i)
+            if (p_i == None):
+                continue       # no ngrams because ref was shorter than i
             p_numerators[i] += p_i.numerator
             p_denominators[i] += p_i.denominator
 
@@ -240,6 +242,7 @@ def modified_precision(references, hypothesis, n):
     and denominator necessary to calculate the corpus-level precision.
     To calculate the modified precision for a single pair of hypothesis and
     references, cast the Fraction object into a float.
+    Returns None if references are shorter than n.
 
     The famous "the the the ... " example shows that you can get BLEU precision
     by duplicating high frequency words.
@@ -332,9 +335,10 @@ def modified_precision(references, hypothesis, n):
     }
 
     numerator = sum(clipped_counts.values())
-    # Ensures that denominator is minimum 1 to avoid ZeroDivisionError.
-    # Usually this happens when the ngram order is > len(reference).
-    denominator = max(1, sum(counts.values()))
+    denominator = sum(counts.values())
+    if denominator == 0:
+        # avoid div by zero when the ngram order is > len(reference)
+        return None
 
     return Fraction(numerator, denominator, _normalize=False)

May 29 '19 15:05 danielgildea

@bmaland Previously, before @bamattsson contribution to https://github.com/nltk/nltk/pull/1844, NLTK's BLEU does some tricks to makes sure that exact string matches gives the 1.0 result but post #1844, the BLEU scores in NLTK is similar to that of what can be found in https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

Do note that BLEU score is meant to be corpus-level metric rather than a sentence level one. This is a good paper describing related issue: https://arxiv.org/pdf/1804.08771.pdf

May 30 '19 07:05 alvations

If you're unsure whether you have short strings in your list, feel free to use the auto-reweigh feature, e.g.

>>> from nltk.translate import bleu
>>> references = ['John loves Mary'.split(), 'John still loves Mary'.split()]
>>> hypothesis = 'John loves Mary'.split()
>>> bleu(references, hypothesis)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/translate/bleu_score.py:523: UserWarning: 
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
1.2213386697554703e-77
>>> bleu(references, hypothesis, auto_reweigh=True)
1.0

May 30 '19 07:05 alvations

In my example above, multi-bleu.pl gives 100.0, but nltk gives 0.84. This is a case where the hyp does have some matching 4-grams, but not every sentence in the ref is of length four or greater.

May 30 '19 10:05 danielgildea

is this still open =?

Dec 22 '23 21:12 Higgs32584

nltk nltk copied to clipboard

Fringe cases in MT eval metrics

nltk
nltk copied to clipboard