lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2]
This PR introduces a new --examples argument to the evaluation pipeline in lm-evaluation-harness, enabling users to evaluate specific examples across multiple tasks. This enhancement extends the functionality of the --limit argument by allowing users to control which examples are included in the evaluation. Users can specify task examples via a JSON file containing a dictionary where keys are task names and values are lists of example indices. For instance, a JSON file might look like this:
{
"mmlu_astronomy": [0, 3, 6],
"mmlu_anatomy": [1, 4, 7, 10],
"mmlu_econometrics": [2, 5, 8, 11, 14]
}
To use this feature, for example, you could save the dictionary to a file (e.g., /path/to/examples.json) and run the following command:
lm_eval \
--model hf \
--model_args pretrained=Qwen/Qwen1.5-0.5B \
--tasks mmlu_astronomy,mmlu_anatomy,mmlu_econometrics \
--device cuda:0 \
--log_samples \
--output_path "/path/to/output" \
--examples "/path/to/examples.json"
If we do not specify the examples for a task, all examples will be evaluated.
This new feature has multiple applications. It allows practitioners to evaluate models on specific subsets of interest, such as critical edge cases or benchmarks. It also supports multi-prompt evaluation using PromptEval [1,2] by enabling the evaluation of a few selected examples for each prompt template, followed by performance distribution estimation. As part of the future roadmap, we plan to integrate PromptEval functionality directly into lm-evaluation-harness to provide a seamless evaluation experience.
References [1] Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024). [2] https://github.com/felipemaiapolo/prompteval
Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)
Can you rerun the pre-commit formatter and then we can merge it?
Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)
Can you rerun the pre-commit formatter and then we can merge it?
Done @StellaAthena, thanks for the review!
Hi! sorry for the delay. This slipped passed me. Left a couple of comments, but the logic looks good!
Was thinking we could combine this with limit. Would make it more maintainable and would allow for more backward compatibility. Thoughts? Was thinking something like if limit is int or float we could keep it as before, but if its a dict we can use examples.
Hi @baberabb,
Thanks a lot for your comments! Regarding your final suggestion of merging limit and examples, I would be good with that. Do you mean just to just change __main__.py or all evaluation functions as well?
@baberabb bumping this
Hi @felipemaiapolo. Thanks for bearing with us. Just fixed a couple of nits, and added a bit more logging. Think its ready to merge now!
What do you think about indices or samples. I'm a bit worried users will conflate this with fewshot examples.
What do you think about
indicesorsamples. I'm a bit worried users will conflate this with fewshot examples.
Thank you for the improvements and comments @baberabb, samples instead of examples is a good option to avoid causing any confusion with the few shots argument. I have already updated the code. cc: @felipemaiapolo
Hi! Thanks for the PR and sorry it took so long. Bit short staffed these days :S
This was actually a very asked for feature by many of our users! Really appreciate it