nltk
nltk copied to clipboard
Fringe cases in MT eval metrics
There are several fringe that still bugs the MT evaluation metrics in nltk.translate
.
The BLEU related issues are mostly resolved in #1330. But similar issues happens in RIBES and CHRF too:
-
ribes_score.py
- https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L290 and https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L320 are subjected to ZeroDivisionError when the no. of possible ngram pairs is 0
-
chrf_score.py
- The interface for the references for other scores supports multi-reference by default while ChrF score supports single reference. It should be standardize to accommodate multi-reference
- But in the case of multi-reference score, there's no indication of which reference to choose in ChrF, we might need to contact the author to understand how to handle this.
- The interface for the references for other scores supports multi-reference by default while ChrF score supports single reference. It should be standardize to accommodate multi-reference
In the Maja Popovic's implementation, when multiple references are provided, the one which leads to the highest f-score is used, see:
https://github.com/m-popovic/chrF/blob/master/chrF%2B%2B.py#L155
There seems to be a bug in computing bleu when one but not all of the sentences are shorter than the max n-gram length. For example, the following test case should get bleu of 1.0 but does not:
references = [['John loves Mary'.split()], ['John still loves Mary'.split()]]
hypothesis = ['John loves Mary'.split(), 'John still loves Mary'.split()]
n = 4 #
weights = [1.0 / n] * n # Uniform weights.
print (corpus_bleu(references, hypothesis, weights))
A sentence of length 3 that is identical to the reference is scored as having 0 out of 1 correct 4-grams, instead of 0 out of 0 correct 4-grams.
Suggested patch:
--- a/nltk/translate/bleu_score.py
+++ b/nltk/translate/bleu_score.py
@@ -183,6 +183,8 @@ def corpus_bleu(
# denominator for the corpus-level modified precision.
for i, _ in enumerate(weights, start=1):
p_i = modified_precision(references, hypothesis, i)
+ if (p_i == None):
+ continue # no ngrams because ref was shorter than i
p_numerators[i] += p_i.numerator
p_denominators[i] += p_i.denominator
@@ -240,6 +242,7 @@ def modified_precision(references, hypothesis, n):
and denominator necessary to calculate the corpus-level precision.
To calculate the modified precision for a single pair of hypothesis and
references, cast the Fraction object into a float.
+ Returns None if references are shorter than n.
The famous "the the the ... " example shows that you can get BLEU precision
by duplicating high frequency words.
@@ -332,9 +335,10 @@ def modified_precision(references, hypothesis, n):
}
numerator = sum(clipped_counts.values())
- # Ensures that denominator is minimum 1 to avoid ZeroDivisionError.
- # Usually this happens when the ngram order is > len(reference).
- denominator = max(1, sum(counts.values()))
+ denominator = sum(counts.values())
+ if denominator == 0:
+ # avoid div by zero when the ngram order is > len(reference)
+ return None
return Fraction(numerator, denominator, _normalize=False)
@bmaland Previously, before @bamattsson contribution to https://github.com/nltk/nltk/pull/1844, NLTK's BLEU does some tricks to makes sure that exact string matches gives the 1.0 result but post #1844, the BLEU scores in NLTK is similar to that of what can be found in https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl
Do note that BLEU score is meant to be corpus-level metric rather than a sentence level one. This is a good paper describing related issue: https://arxiv.org/pdf/1804.08771.pdf
If you're unsure whether you have short strings in your list, feel free to use the auto-reweigh feature, e.g.
>>> from nltk.translate import bleu
>>> references = ['John loves Mary'.split(), 'John still loves Mary'.split()]
>>> hypothesis = 'John loves Mary'.split()
>>> bleu(references, hypothesis)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/translate/bleu_score.py:523: UserWarning:
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
warnings.warn(_msg)
1.2213386697554703e-77
>>> bleu(references, hypothesis, auto_reweigh=True)
1.0
In my example above, multi-bleu.pl gives 100.0, but nltk gives 0.84. This is a case where the hyp does have some matching 4-grams, but not every sentence in the ref is of length four or greater.
is this still open =?