lm-evaluation-harness add bypass metric

Started on #1152. Couple of issues:

[x] Cannot amend metric_list in evaluate
[x] print results crashing
[x] repeated generations are not saved in log_samples

Dec 18 '23 15:12 baberabb

This is an excellent idea.

Dec 18 '23 16:12 StellaAthena

@haileyschoelkopf ready for review! I called the metric bypass. I don't know if you want to keep that or change it to predict_only.

Also had to get a bit hacky in changing the task attributes post-hoc if we are calling from __main__. Couldn't find an easier way.

Dec 18 '23 19:12 baberabb

@lintangsutawika any updated thoughts on this, now that we are not planning to rework the metric / aggregation abstraction any time soon?

I'm quite a big fan of doing things in the way this PR implements it.

Dec 22 '23 19:12 haileyschoelkopf

I'm not entirely on board, but I see this as needed for --predict_only so this should be okay for now before we rework the metric / aggregation abstraction.

Dec 23 '23 04:12 lintangsutawika

ok have tested it out and looks to be in working order. No major nits except logliklihood table gets printed like this:

hf (pretrained=EleutherAI/pythia-70m,device=mps), gen_kwargs: (None), limit: 20.0, num_fewshot: None, batch_size: 8
|Tasks|Version|Filter|n-shot|Metric|Value|   |Stderr|
|-----|-------|------|------|------|-----|---|------|

And if its a mixed run then the logliklihood tasks do not print. The samples are saved fine though. IMO will be used almost exclusively for generation tasks anyway.

Dec 23 '23 09:12 baberabb

If predict_only is exclusively for generation do we still want to allow users to mix task of different output types and while using the arg? It might be better not to and also then skip the table (skip metric any metric calculation).

Dec 23 '23 10:12 lintangsutawika

It works for both tasks but can't think of when people would want to use it for logliklihood. Maybe for debugging or experimenting? I can change it so that only generation scoring is bypassed using the --predict_only flag.

Dec 23 '23 10:12 baberabb

Bypassing only generation tasks makes sense.

Dec 23 '23 10:12 lintangsutawika

Also noticed we aren't passing do_sample when people change the gen_kwargs. Should we force that or is it expected people know they need to pass that on as well. There's usually a warning from HF.

Dec 23 '23 11:12 baberabb

Bypassing only generation tasks makes sense.

This seems nonintuitive for users. A warning should be given when running non-generate tasks saying they’ll still be scored if we pass predict_only, at least

Also, we’ll need to document this CLI arg and feature in the docs!

Dec 23 '23 12:12 haileyschoelkopf

Updated the warning for gen_kwargs. The previous one said "these settings will be used over set parameters in yaml tasks" but we only updated the dict though? Not sure which was the intended functionality. Tried to make the current one more obvious.

For --predict_only a specific warning is generated for each task changed.

Dec 23 '23 12:12 baberabb

Added come drafts in the Readme! Also moved the table to the bottom, thought this big (and increasingly growing!) thing broke the flow of the document. Let me know what if this is better or I should revert it.

Dec 23 '23 13:12 baberabb

Sounds good! Shouldn't this be merged after #1167? Looks like most of the workarounds here won't be needed after that.

Also thinking about it, not really sold on the predict_only flag. Under what circumstances would someone want to use it (as opposed to bypass metric in the config)?. Seems like you can just run it normally and get the metric for free if you just want the samples.

Dec 30 '23 15:12 baberabb

Shouldn't this be merged after https://github.com/EleutherAI/lm-evaluation-harness/pull/1167? Looks like most of the workarounds here won't be needed after that.

Sure, let's see what our timeline on merging #1167 is next week and go from there.

Also thinking about it, not really sold on the predict_only flag. Under what circumstances would someone want to use it (as opposed to bypass metric in the config)?. Seems like you can just run it normally and get the metric for free if you just want the samples.

I think it's still valuable to have! For example, in the Llemma sympy-checked math tasks, for Maj@K at high K, doing the scoring actually takes way more wall-clock time than inference. Because it's CPU-bottlenecked and one wants to conserve GPU-hours / minimize time until logging results, it's valuable to do this heavy answer checking offline on a separate machine. However, allowing for it to give metrics at runtime is still valuable for QoL if someone doesn't care about this delay.

Also, the user who originally requested some version of this feature wanted it in order to run on tasks that are typically scored online, but wanted to run some postprocessing outside the library manually then score those edited results as normal.

Dec 30 '23 17:12 haileyschoelkopf

I think it's still valuable to have! For example, in the Llemma sympy-checked math tasks, for Maj@K at high K, doing the scoring actually takes way more wall-clock time than inference.

aah yeah that makes more sense! I was thinking more along the lines of just calculating acc or exact match.

Dec 31 '23 07:12 baberabb

lm-evaluation-harness lm-evaluation-harness copied to clipboard

add bypass metric

lm-evaluation-harness
lm-evaluation-harness copied to clipboard