evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

Compute BLEU score of a Pandas DataFrame with valid rows filtered

Open shivanraptor opened this issue 1 year ago • 0 comments

I have a Pandas DataFrame from an Excel file, which contains text data which need to calculate the BLEU score row-by-row.

import evaluate
import pandas as pd
sacrebleu = evaluate.load("sacrebleu")

testset = pd.read_excel(xlsx_filename)
# find out valid rows with all columns are valid
valid_rows = testset['col1'].notna() & testset['col2'].notna() & testset['col3'].notna()

for i in range(len(testset)): # or... for i in range(len(testset.loc[valid_rows, 'col2']))
    score = sacrebleu.compute(predictions=[testset.loc[valid_rows, 'col1'][i], testset.loc[valid_rows, 'col2'][i]], references=[testset.loc[valid_rows, 'col3'][i]])

It raises KeyError: 139.

The length of valid_rows and testset are 13700, while the length of testset.loc[valid_rows, 'col2'] is 12208.

I know loop through for-loop is an anti-pattern, but how can I fit a Series into the sacrebleu.compute() function? It accepts only [string, string], string as input.

How can I solve this problem?


This question is also at: https://stackoverflow.com/questions/76581453/compute-bleu-score-of-a-pandas-dataframe-with-valid-rows-filtered

shivanraptor avatar Jun 29 '23 13:06 shivanraptor