Question on TriviaQA evaluation indicator

Open taltalw opened this issue 9 months ago • 0 comments

Hi @alisawuffles, thanks for you novel work! I have a question about the evaluation indicator. When you evaluate models on triviaqa datasets, you calculate the accuracy as follows:

test_df['output'] = [o.strip() for o in outputs]
cors = []
for i, row in test_df.iterrows():
    # ignore casing
    pred = row['output'].lower()
    answers = [a.strip().lower() for a in row['answers']]
    cors.append(pred in answers)

test_df['correct'] = cors
acc = np.nanmean(cors)

This code snippet seems to judge whether the output matches the prediction exactly. It seems more like EM score than accuracy.

Mar 20 '25 12:03 taltalw