lm-evaluation-harness
lm-evaluation-harness copied to clipboard
add bypass metric
Started on #1152. Couple of issues:
- [x] Cannot amend
metric_list
in evaluate - [x] print results crashing
- [x] repeated generations are not saved in
log_samples
This is an excellent idea.
@haileyschoelkopf ready for review! I called the metric bypass
. I don't know if you want to keep that or change it to predict_only
.
Also had to get a bit hacky in changing the task attributes post-hoc if we are calling from __main__
. Couldn't find an easier way.
@lintangsutawika any updated thoughts on this, now that we are not planning to rework the metric / aggregation abstraction any time soon?
I'm quite a big fan of doing things in the way this PR implements it.
I'm not entirely on board, but I see this as needed for --predict_only
so this should be okay for now before we rework the metric / aggregation abstraction.
ok have tested it out and looks to be in working order. No major nits except logliklihood
table gets printed like this:
hf (pretrained=EleutherAI/pythia-70m,device=mps), gen_kwargs: (None), limit: 20.0, num_fewshot: None, batch_size: 8
|Tasks|Version|Filter|n-shot|Metric|Value| |Stderr|
|-----|-------|------|------|------|-----|---|------|
And if its a mixed run then the logliklihood tasks do not print. The samples are saved fine though. IMO will be used almost exclusively for generation
tasks anyway.
If predict_only is exclusively for generation do we still want to allow users to mix task of different output types and while using the arg? It might be better not to and also then skip the table (skip metric any metric calculation).
It works for both tasks but can't think of when people would want to use it for logliklihood. Maybe for debugging or experimenting? I can change it so that only generation scoring is bypassed using the --predict_only
flag.
Bypassing only generation tasks makes sense.
Also noticed we aren't passing do_sample
when people change the gen_kwargs. Should we force that or is it expected people know they need to pass that on as well. There's usually a warning from HF.
Bypassing only generation tasks makes sense.
This seems nonintuitive for users. A warning should be given when running non-generate tasks saying they’ll still be scored if we pass predict_only, at least
Also, we’ll need to document this CLI arg and feature in the docs!
Updated the warning for gen_kwargs
. The previous one said "these settings will be used over set parameters in yaml tasks" but we only updated the dict though? Not sure which was the intended functionality. Tried to make the current one more obvious.
For --predict_only
a specific warning is generated for each task changed.
Added come drafts in the Readme
! Also moved the table to the bottom, thought this big (and increasingly growing!) thing broke the flow of the document. Let me know what if this is better or I should revert it.
Sounds good! Shouldn't this be merged after #1167? Looks like most of the workarounds here won't be needed after that.
Also thinking about it, not really sold on the predict_only
flag. Under what circumstances would someone want to use it (as opposed to bypass
metric in the config)?. Seems like you can just run it normally and get the metric for free if you just want the samples.
Shouldn't this be merged after https://github.com/EleutherAI/lm-evaluation-harness/pull/1167? Looks like most of the workarounds here won't be needed after that.
Sure, let's see what our timeline on merging #1167 is next week and go from there.
Also thinking about it, not really sold on the predict_only flag. Under what circumstances would someone want to use it (as opposed to bypass metric in the config)?. Seems like you can just run it normally and get the metric for free if you just want the samples.
I think it's still valuable to have! For example, in the Llemma sympy-checked math tasks, for Maj@K at high K, doing the scoring actually takes way more wall-clock time than inference. Because it's CPU-bottlenecked and one wants to conserve GPU-hours / minimize time until logging results, it's valuable to do this heavy answer checking offline on a separate machine. However, allowing for it to give metrics at runtime is still valuable for QoL if someone doesn't care about this delay.
Also, the user who originally requested some version of this feature wanted it in order to run on tasks that are typically scored online, but wanted to run some postprocessing outside the library manually then score those edited results as normal.
I think it's still valuable to have! For example, in the Llemma sympy-checked math tasks, for Maj@K at high K, doing the scoring actually takes way more wall-clock time than inference.
aah yeah that makes more sense! I was thinking more along the lines of just calculating acc or exact match.