lm-evaluation-harness
lm-evaluation-harness copied to clipboard
More Flexible Answer Extraction Code
In LM Evaluation Harness, we work to match the "original" / "default" methods used to evaluate datasets. This includes using whatever answer extraction / post-processing is done by the original code implementations if provided, even if such extraction is flawed and may miss correct-in-substance-but-not-form answers.
Where appropriate and requested, we may consider adding more flexible answer extraction code. This has been requested by many users. I think a good middle ground might be to support both a standard and loose filter pipeline or metric for various datasets, such as for GSM8k: scoring based only on the last number output from the model, as opposed to scoring #### {number} only as correct but not things such as so, the answer is {number}.
There's definitely a balance to be struck though between being too permissive and being sufficiently flexible s.t. benchmarks aren't just a test of performing the right formatting steps + incantations for a model.
This issue is to track our addition of such flexibility / improvements, and solicit requests or feedback on this.
Is this PR related -- https://github.com/EleutherAI/lm-evaluation-harness/pull/943 ?
Yup! This and the triviaqa ones are good examples of what we’ll want to handle.
Ideally we can use multiple filter pipelines for this purpose.