lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Is it possible to use BLEU with multiple references?

Open juliafalcao opened this issue 1 year ago • 2 comments

I'm creating a new task and I would like to evaluate my generated output against N different references with BLEU, but the code appears to only pick up the first available reference, and I'm not sure how to map the doc_to_target in the task YAML to include multiple refs.

juliafalcao avatar Dec 14 '23 10:12 juliafalcao

I'm assuming if you want to use the BLEU metric, then you would want to use the generate_until task type. In that case, you could also use the HF's implementation of BLEU.

For doc_to_target we support it having more than 1 answer so you could make it that the dataset used has a gold feature that stores a list of references for each sample.

lintangsutawika avatar Dec 14 '23 13:12 lintangsutawika

TriviaQA is one example of a dataset that uses multiple references! https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/triviaqa/default.yaml Please let us know if you have trouble mapping this onto BLEU.

haileyschoelkopf avatar Dec 14 '23 15:12 haileyschoelkopf