lm-evaluation-harness CoQA's implementation only predicts the last answer of each text

CoQA's implementation only predicts the last answer of each text

Open glerzing opened this issue 1 year ago • 1 comments

For CoQA, in coqa/utils.py, only the last answer of each text (i.e. the answer for the last turn_id, with all the previous questions and answers in the context window) is predicted. On the website of the authors of CoQA, they seem to consider all turn_id (see their sample prediction file where they have answers for every turn_id, and the official evaluation script where they do an average of the result for every turn_id).

Here is an excerpt from the paper (https://arxiv.org/pdf/1808.07042.pdf) :

I haven't found how it's implemented with other popular LLM evaluation frameworks, but I'm pretty sure that predicting only the answer to the last question is not what is intended by the authors of CoQA.

Jan 01 '24 05:01 glerzing

I think it probably makes sense to support both versions of this task, but we should make it clear that the one described in the OP is the official one.

Jan 01 '24 13:01 StellaAthena

lm-evaluation-harness lm-evaluation-harness copied to clipboard

CoQA's implementation only predicts the last answer of each text

lm-evaluation-harness
lm-evaluation-harness copied to clipboard