sacrebleu icon indicating copy to clipboard operation
sacrebleu copied to clipboard

Inconsistent scores between loop and separate check

Open 106AbdulBasit opened this issue 1 year ago • 0 comments

Description: I am encountering an issue with SacreBLEU where I am getting inconsistent scores between a loop implementation and a separate check for individual translations. Here are the details of the problem:

sacrebleu.sentence_bleu(sys, [refs])

Scenario: I am calculating BLEU scores for translations using both a loop and individual checks. Expected Behavior: I anticipate consistent scores between the loop and the separate checks for the same translations. Actual Behavior: The scores obtained from the loop implementation differ from the scores obtained from the separate check, even when using the same translation and reference pairs. Example: Here is an example that demonstrates the discrepancy: Translation: sys4 = "..." # Example translation Reference: ref4 = ["..."] # Example reference Expected Score (separate check): 100.0004 Actual Score (loop): 31.94 Steps to Reproduce:

Load the necessary data and libraries. Implement the loop calculation using SacreBLEU, storing scores for each translation. Perform a separate check for a specific translation and reference pair, using the same SacreBLEU calculation. Compare the scores obtained from the loop and separate check. Additional Information:

I have tried modifying the code, removing any potential sources of error, but the discrepancy persists. I have verified that the data inputs are aligned correctly, and the sentence preprocessing is consistent. I suspect there might be an issue related to how SacreBLEU is utilized in the loop implementation. Any guidance or insight into this issue would be greatly appreciated. Thank you!

106AbdulBasit avatar Jun 12 '23 10:06 106AbdulBasit