lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

More Flexible Answer Extraction Code

Open haileyschoelkopf opened this issue 1 year ago • 2 comments

In LM Evaluation Harness, we work to match the "original" / "default" methods used to evaluate datasets. This includes using whatever answer extraction / post-processing is done by the original code implementations if provided, even if such extraction is flawed and may miss correct-in-substance-but-not-form answers.

Where appropriate and requested, we may consider adding more flexible answer extraction code. This has been requested by many users. I think a good middle ground might be to support both a standard and loose filter pipeline or metric for various datasets, such as for GSM8k: scoring based only on the last number output from the model, as opposed to scoring #### {number} only as correct but not things such as so, the answer is {number}.

There's definitely a balance to be struck though between being too permissive and being sufficiently flexible s.t. benchmarks aren't just a test of performing the right formatting steps + incantations for a model.

This issue is to track our addition of such flexibility / improvements, and solicit requests or feedback on this.

haileyschoelkopf avatar Dec 18 '23 18:12 haileyschoelkopf

Is this PR related -- https://github.com/EleutherAI/lm-evaluation-harness/pull/943 ?

anjor avatar Dec 31 '23 23:12 anjor

Yup! This and the triviaqa ones are good examples of what we’ll want to handle.

Ideally we can use multiple filter pipelines for this purpose.

haileyschoelkopf avatar Jan 01 '24 00:01 haileyschoelkopf