proxy-tuning
proxy-tuning copied to clipboard
Question on TriviaQA evaluation indicator
Hi @alisawuffles, thanks for you novel work! I have a question about the evaluation indicator. When you evaluate models on triviaqa datasets, you calculate the accuracy as follows:
test_df['output'] = [o.strip() for o in outputs]
cors = []
for i, row in test_df.iterrows():
# ignore casing
pred = row['output'].lower()
answers = [a.strip().lower() for a in row['answers']]
cors.append(pred in answers)
test_df['correct'] = cors
acc = np.nanmean(cors)
This code snippet seems to judge whether the output matches the prediction exactly. It seems more like EM score than accuracy.