lm-evaluation-harness Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2]

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2]

Open felipemaiapolo opened this issue 1 year ago • 9 comments

This PR introduces a new --examples argument to the evaluation pipeline in lm-evaluation-harness, enabling users to evaluate specific examples across multiple tasks. This enhancement extends the functionality of the --limit argument by allowing users to control which examples are included in the evaluation. Users can specify task examples via a JSON file containing a dictionary where keys are task names and values are lists of example indices. For instance, a JSON file might look like this:

{
  "mmlu_astronomy": [0, 3, 6],
  "mmlu_anatomy": [1, 4, 7, 10],
  "mmlu_econometrics": [2, 5, 8, 11, 14]
}

To use this feature, for example, you could save the dictionary to a file (e.g., /path/to/examples.json) and run the following command:

lm_eval \
  --model hf \
  --model_args pretrained=Qwen/Qwen1.5-0.5B \
  --tasks mmlu_astronomy,mmlu_anatomy,mmlu_econometrics \
  --device cuda:0 \
  --log_samples \
  --output_path "/path/to/output" \
  --examples "/path/to/examples.json"

If we do not specify the examples for a task, all examples will be evaluated.

This new feature has multiple applications. It allows practitioners to evaluate models on specific subsets of interest, such as critical edge cases or benchmarks. It also supports multi-prompt evaluation using PromptEval [1,2] by enabling the evaluation of a few selected examples for each prompt template, followed by performance distribution estimation. As part of the future roadmap, we plan to integrate PromptEval functionality directly into lm-evaluation-harness to provide a seamless evaluation experience.

References [1] Maia Polo, Felipe, Ronald Xu, Lucas Weber, Mírian Silva, Onkar Bhardwaj, Leshem Choshen, Allysson Flavio Melo de Oliveira, Yuekai Sun, and Mikhail Yurochkin. "Efficient multi-prompt evaluation of LLMs." arXiv preprint arXiv:2405.17202 (2024). [2] https://github.com/felipemaiapolo/prompteval

Nov 26 '24 19:11 felipemaiapolo

All committers have signed the CLA.

Nov 26 '24 19:11 CLAassistant

Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)

Can you rerun the pre-commit formatter and then we can merge it?

Jan 14 '25 17:01 StellaAthena

Sorry for the delay! This looks good and we look forward to further integration of PromptEval functionality :)

Can you rerun the pre-commit formatter and then we can merge it?

Done @StellaAthena, thanks for the review!

Jan 15 '25 18:01 mirianfsilva

Hi! sorry for the delay. This slipped passed me. Left a couple of comments, but the logic looks good!

Was thinking we could combine this with limit. Would make it more maintainable and would allow for more backward compatibility. Thoughts? Was thinking something like if limit is int or float we could keep it as before, but if its a dict we can use examples.

Jan 20 '25 20:01 baberabb

Hi @baberabb,

Thanks a lot for your comments! Regarding your final suggestion of merging limit and examples, I would be good with that. Do you mean just to just change __main__.py or all evaluation functions as well?

Feb 11 '25 19:02 felipemaiapolo

@baberabb bumping this

Feb 26 '25 22:02 StellaAthena

Hi @felipemaiapolo. Thanks for bearing with us. Just fixed a couple of nits, and added a bit more logging. Think its ready to merge now!

Mar 17 '25 23:03 baberabb

What do you think about indices or samples. I'm a bit worried users will conflate this with fewshot examples.

Mar 17 '25 23:03 baberabb

What do you think about indices or samples. I'm a bit worried users will conflate this with fewshot examples.

Thank you for the improvements and comments @baberabb, samples instead of examples is a good option to avoid causing any confusion with the few shots argument. I have already updated the code. cc: @felipemaiapolo

Mar 25 '25 15:03 mirianfsilva

Hi! Thanks for the PR and sorry it took so long. Bit short staffed these days :S

This was actually a very asked for feature by many of our users! Really appreciate it

Apr 07 '25 12:04 baberabb

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Add `--examples` Argument for Fine-Grained Task Evaluation in `lm-evaluation-harness`. This feature is the first step towards efficient multi-prompt evaluation with PromptEval [1,2]

lm-evaluation-harness
lm-evaluation-harness copied to clipboard