evaluate Bug in BLEU calculation - BLEU of identical strings should be 1.0, not 0.0

Bug in BLEU calculation - BLEU of identical strings should be 1.0, not 0.0

Open MislavJuric opened this issue 1 year ago • 2 comments

Summary

I think I discovered a bug related to BLEU calculation which occurs if you compare two identical strings. The expected behavior is that the BLEU of completely identical strings is 1.0, while in some cases I get 0.0.

OS and Python versions

OS: Windows 10 Pro
Python: 3.10.13
Evaluate: 0.4.1

Bug description

If you run:

import evaluate
bleu = evaluate.load("bleu")
bleu.compute(predictions=["Duplicate string"], references=[["Dupliacte string"]])

you get:

{'bleu': 0.0,
 'precisions': [0.5, 0.0, 0.0, 0.0],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 2,
 'reference_length': 2}

as the output. I expected the 'bleu' to be 1.0.

Similar thing happens if you run:

bleu.compute(predictions=["foobar"], references=[["foobar"]])

you get:

{'bleu': 0.0,
 'precisions': [1.0, 0.0, 0.0, 0.0],
 'brevity_penalty': 1.0,
 'length_ratio': 1.0,
 'translation_length': 1,
 'reference_length': 1}

Again, I expected BLEU to be 1.0, not 0.0.

Dec 27 '23 10:12 MislavJuric

This will happen for any input sentence having less than 4 words for the default case.

I checked and found that evaluate uses Tensorflow implementation for bleu calculation. Based on the following code, if the sentence length is less than the max_order, then you will always get the bleu score as 0 (irrespective of the reference and predicted sentence).

if min(precisions) > 0:
  p_log_sum = sum((1. / max_order) * math.log(p) for p in precisions)
  geo_mean = math.exp(p_log_sum)
else:
  geo_mean = 0

.
.
.

bleu = geo_mean * bp

I'm not an expert in machine translation, so not sure if this is the expected behaviour. However, I would have guessed that if the precision is 0, then that particular n-gram case should not be considered for the geometric mean (coefficient must be made 0, instead of entire geo_mean).

Hoping some expert here helps us out to mention the correct resolution.

Dec 31 '23 07:12 Shyam-Thombre

Hello! @MislavJuric

@Shyam-Thombre is correct that by default, BLEU considers 4-grams when computing the score. In order to obtain your preferred output, you may set the value of the max_order attribute to the number of k-grams you want BLEU to consider.

For example, bleu.compute(predictions=["Duplicate string"], references=[["Duplicate string"]], max_order=2) Will give you {'bleu': 1.0, 'precisions': [1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.0, 'translation_length': 2, 'reference_length': 2}

Apr 06 '24 21:04 simpleParadox

evaluate evaluate copied to clipboard

Bug in BLEU calculation - BLEU of identical strings should be 1.0, not 0.0

Summary

OS and Python versions

Bug description

evaluate
evaluate copied to clipboard