evaluate
evaluate copied to clipboard
Bug in BLEU calculation - BLEU of identical strings should be 1.0, not 0.0
Summary
I think I discovered a bug related to BLEU calculation which occurs if you compare two identical strings. The expected behavior is that the BLEU of completely identical strings is 1.0, while in some cases I get 0.0.
OS and Python versions
- OS: Windows 10 Pro
- Python: 3.10.13
- Evaluate: 0.4.1
Bug description
If you run:
import evaluate
bleu = evaluate.load("bleu")
bleu.compute(predictions=["Duplicate string"], references=[["Dupliacte string"]])
you get:
{'bleu': 0.0,
'precisions': [0.5, 0.0, 0.0, 0.0],
'brevity_penalty': 1.0,
'length_ratio': 1.0,
'translation_length': 2,
'reference_length': 2}
as the output. I expected the 'bleu'
to be 1.0
.
Similar thing happens if you run:
bleu.compute(predictions=["foobar"], references=[["foobar"]])
you get:
{'bleu': 0.0,
'precisions': [1.0, 0.0, 0.0, 0.0],
'brevity_penalty': 1.0,
'length_ratio': 1.0,
'translation_length': 1,
'reference_length': 1}
Again, I expected BLEU to be 1.0
, not 0.0
.
This will happen for any input sentence having less than 4 words for the default case.
I checked and found that evaluate uses Tensorflow implementation for bleu calculation. Based on the following code, if the sentence length is less than the max_order, then you will always get the bleu score as 0 (irrespective of the reference and predicted sentence).
if min(precisions) > 0:
p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
.
.
.
bleu = geo_mean * bp
I'm not an expert in machine translation, so not sure if this is the expected behaviour. However, I would have guessed that if the precision is 0, then that particular n-gram case should not be considered for the geometric mean (coefficient must be made 0, instead of entire geo_mean).
Hoping some expert here helps us out to mention the correct resolution.
Hello! @MislavJuric
@Shyam-Thombre is correct that by default, BLEU considers 4-grams when computing the score. In order to obtain your preferred output, you may set the value of the max_order attribute to the number of k-grams you want BLEU to consider.
For example,
bleu.compute(predictions=["Duplicate string"], references=[["Duplicate string"]], max_order=2)
Will give you
{'bleu': 1.0, 'precisions': [1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 2, 'reference_length': 2}