nltk
nltk copied to clipboard
Fringe cases in MT eval metrics
There are several fringe that still bugs the MT evaluation metrics in nltk.translate.
The BLEU related issues are mostly resolved in #1330. But similar issues happens in RIBES and CHRF too:
-
ribes_score.py- https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L290 and https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L320 are subjected to ZeroDivisionError when the no. of possible ngram pairs is 0
-
chrf_score.py- The interface for the references for other scores supports multi-reference by default while ChrF score supports single reference. It should be standardize to accommodate multi-reference
- But in the case of multi-reference score, there's no indication of which reference to choose in ChrF, we might need to contact the author to understand how to handle this.
- The interface for the references for other scores supports multi-reference by default while ChrF score supports single reference. It should be standardize to accommodate multi-reference
In the Maja Popovic's implementation, when multiple references are provided, the one which leads to the highest f-score is used, see:
https://github.com/m-popovic/chrF/blob/master/chrF%2B%2B.py#L155
There seems to be a bug in computing bleu when one but not all of the sentences are shorter than the max n-gram length. For example, the following test case should get bleu of 1.0 but does not:
references = [['John loves Mary'.split()], ['John still loves Mary'.split()]]
hypothesis = ['John loves Mary'.split(), 'John still loves Mary'.split()]
n = 4 #
weights = [1.0 / n] * n # Uniform weights.
print (corpus_bleu(references, hypothesis, weights))
A sentence of length 3 that is identical to the reference is scored as having 0 out of 1 correct 4-grams, instead of 0 out of 0 correct 4-grams.
Suggested patch:
--- a/nltk/translate/bleu_score.py
+++ b/nltk/translate/bleu_score.py
@@ -183,6 +183,8 @@ def corpus_bleu(
# denominator for the corpus-level modified precision.
for i, _ in enumerate(weights, start=1):
p_i = modified_precision(references, hypothesis, i)
+ if (p_i == None):
+ continue # no ngrams because ref was shorter than i
p_numerators[i] += p_i.numerator
p_denominators[i] += p_i.denominator
@@ -240,6 +242,7 @@ def modified_precision(references, hypothesis, n):
and denominator necessary to calculate the corpus-level precision.
To calculate the modified precision for a single pair of hypothesis and
references, cast the Fraction object into a float.
+ Returns None if references are shorter than n.
The famous "the the the ... " example shows that you can get BLEU precision
by duplicating high frequency words.
@@ -332,9 +335,10 @@ def modified_precision(references, hypothesis, n):
}
numerator = sum(clipped_counts.values())
- # Ensures that denominator is minimum 1 to avoid ZeroDivisionError.
- # Usually this happens when the ngram order is > len(reference).
- denominator = max(1, sum(counts.values()))
+ denominator = sum(counts.values())
+ if denominator == 0:
+ # avoid div by zero when the ngram order is > len(reference)
+ return None
return Fraction(numerator, denominator, _normalize=False)
@bmaland Previously, before @bamattsson contribution to https://github.com/nltk/nltk/pull/1844, NLTK's BLEU does some tricks to makes sure that exact string matches gives the 1.0 result but post #1844, the BLEU scores in NLTK is similar to that of what can be found in https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl
Do note that BLEU score is meant to be corpus-level metric rather than a sentence level one. This is a good paper describing related issue: https://arxiv.org/pdf/1804.08771.pdf
If you're unsure whether you have short strings in your list, feel free to use the auto-reweigh feature, e.g.
>>> from nltk.translate import bleu
>>> references = ['John loves Mary'.split(), 'John still loves Mary'.split()]
>>> hypothesis = 'John loves Mary'.split()
>>> bleu(references, hypothesis)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/translate/bleu_score.py:523: UserWarning:
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
warnings.warn(_msg)
1.2213386697554703e-77
>>> bleu(references, hypothesis, auto_reweigh=True)
1.0
In my example above, multi-bleu.pl gives 100.0, but nltk gives 0.84. This is a case where the hyp does have some matching 4-grams, but not every sentence in the ref is of length four or greater.
is this still open =?