evaluate
evaluate copied to clipboard
the difference of your bleu and sacrebleu
What is the difference between your package's bleu implementation and sacrebleu implementation? I calculated the result differently in the two ways, Chinese expected, passed sacrebleu's zh tokenizer
I believe there are some differences between the implementation and sacrebleu's. Actruly, testing with English has the same problem.
evaluate
import evaluate
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
["hello there general kenobi", "hello there !"],
["foo bar foobar"]
]
bleu = evaluate.load("bleu")
results = bleu.compute(predictions=predictions, references=references, smooth=False, max_order=4)
print(results)
got results:
{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}
sacrebleu
from sacrebleu.metrics import BLEU
predictions = ["hello there general kenobi", "foo bar foobar"]
references = [
["hello there general kenobi", "hello there !"],
["foo bar foobar"]
]
bleu = BLEU(smooth_method="none", max_ngram_order=4, tokenize='13a')
results = bleu.corpus_score(predictions, references)
print(results)
got results:
BLEU = 100.00 100.0/100.0/100.0/100.0 (BP = 1.000 ratio = 1.000 hyp_len = 4 ref_len = 4)