lm-evaluation-harness More Flexible Answer Extraction Code

More Flexible Answer Extraction Code

Open haileyschoelkopf opened this issue 1 year ago • 2 comments

In LM Evaluation Harness, we work to match the "original" / "default" methods used to evaluate datasets. This includes using whatever answer extraction / post-processing is done by the original code implementations if provided, even if such extraction is flawed and may miss correct-in-substance-but-not-form answers.

Where appropriate and requested, we may consider adding more flexible answer extraction code. This has been requested by many users. I think a good middle ground might be to support both a standard and loose filter pipeline or metric for various datasets, such as for GSM8k: scoring based only on the last number output from the model, as opposed to scoring #### {number} only as correct but not things such as so, the answer is {number}.

There's definitely a balance to be struck though between being too permissive and being sufficiently flexible s.t. benchmarks aren't just a test of performing the right formatting steps + incantations for a model.

This issue is to track our addition of such flexibility / improvements, and solicit requests or feedback on this.

Dec 18 '23 18:12 haileyschoelkopf

Is this PR related -- https://github.com/EleutherAI/lm-evaluation-harness/pull/943 ?

Dec 31 '23 23:12 anjor

Yup! This and the triviaqa ones are good examples of what we’ll want to handle.

Ideally we can use multiple filter pipelines for this purpose.

Jan 01 '24 00:01 haileyschoelkopf

lm-evaluation-harness lm-evaluation-harness copied to clipboard

More Flexible Answer Extraction Code

lm-evaluation-harness
lm-evaluation-harness copied to clipboard