lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Is it possible to use BLEU with multiple references?
I'm creating a new task and I would like to evaluate my generated output against N different references with BLEU, but the code appears to only pick up the first available reference, and I'm not sure how to map the doc_to_target
in the task YAML to include multiple refs.
I'm assuming if you want to use the BLEU metric, then you would want to use the generate_until
task type. In that case, you could also use the HF's implementation of BLEU.
For doc_to_target
we support it having more than 1 answer so you could make it that the dataset used has a gold feature that stores a list of references for each sample.
TriviaQA is one example of a dataset that uses multiple references! https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/triviaqa/default.yaml Please let us know if you have trouble mapping this onto BLEU.