lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Add `--predict_only` mode (run without scoring outputs)
As asked for by @dwadden from AI2.
We should support a CLI flag, --predict_only
, which causes model outputs to be produced and saved, but to exit while reporting metric = N/A
for all metrics. This is a useful feature in general.
This could also be extended, if we so desire, to code execution tasks--so that in lm-evaluation-harness, we support the ability to run the generation passes needed for a HumanEval(+) score, but tell the user to take those model outputs and score them offline, properly sandboxed, at their own risk.
cc @StellaAthena --would you be willing to allow for us to support code generation benchmarks if we did not perform the code generation online, and just supported the ability to "dry-run" the model outputs?
I think this would be a good medium where we can still support code tasks (which are in high demand and people would appreciate having in LM-Eval-Harness). Currently, the standard practice is to just run the code without sandboxing in the community (e.g. Bigcode Eval Harness has just one CLI flag that turns on code execution), which is quite a bit different from before that bridge had been crossed, and we could still do significantly better / help raise the standard by not allowing for online evaluation of this code within our library and pointing to security practices users should take.
would a metric: bypass
be a good abstraction do you think?
Yes, I think some "don't run" metric like that would be a good choice, for things like code tasks where we'd want that task's metric to be always bypassed at runtime.
cc @StellaAthena --would you be willing to allow for us to support code generation benchmarks if we did not perform the code generation online, and just supported the ability to "dry-run" the model outputs?
Yes, this is an excellent idea.
Fantastic, I'm going to make trackers for some major code benchmarks soon then.
@haileyschoelkopf @baberabb I'm not sold on the bypass metric. Instead we could use the predict_only
flag to skip evaluations.
https://github.com/EleutherAI/lm-evaluation-harness/blob/42730d90b388931e336c3071f8b0bbe0fcb69493/lm_eval/evaluator.py#L366-L386
my understanding is that for this mode, we need to return resps
in alternative format to the output from --log_samples
? In that case everything after this snippet (aggregation, constructing the dict for printing tasks) could be skipped over.
This is a good impetus to rework the metrics+aggregation and process_results
method anyway.
I think that would be much neater than passing dummy metric calls which I was considering but we still need something in the configs for tasks like HumanEval, where evaluation metrics are unsupported and we only allow output generations?
I think we can write a warning/check for when --predict_only but the config in use has a list of metrics that it won't be used or printed. I imagine there will be cases outside of code evaluations were a user also wants the outputs (batch inference and analysis for example).
@lintangsutawika there are a few reasons I like this bypass
abstraction—
- allows for us to run both a code eval task and a task that does do scoring simultaneously, and just have the code task not score but the others will score correctly
- I like having a dummy results table printed because it makes it clear to the user their run was successful with all the tasks inferencing correctly just not printing true scores.
my understanding is that for this mode, we need to return resps in alternative format to the output from --log_samples?
I don’t think this is the case? We can just set up the offline eval scripts to receive the log_samples format
We could think of a way that still prints the final table. But a metric that does the pass through feels a tad hackish and contrary to the idea of improving metrics+aggregation.
Also, log_samples actually already does save models resps which might already satisfy what's needed for offline eval but still doesnt address bypassing
With the aggregate metric, the hack-iness does decrease quite a bit IMO. Plus would allow for tasks without targets rather than needing to put in a dummy column 😂.
Is it intended behaviour that with --predict_only
argument, metrics are still reported to calculate (which may take time for bootstrappable metrics, for example, that is how i found this — from printed logs) by this (i suppose) code https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L495-L517 (or am i missing some setup in my custom made task from the Task class)?
Only model outputs will be saved and metrics will not be evaluated.
I read this description of the argument.
Is it intended behaviour that with
--predict_only
argument, metrics are still reported to calculate (which may take time for bootstrappable metrics, for example, that is how i found this — from printed logs) by this (i suppose) code https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L495-L517 (or am i missing some setup in my custom made task from the Task class)?
Only model outputs will be saved and metrics will not be evaluated.
I read this description of the argument.
yeah it's a bit of a hack right now. We just send the responses through noop dummy metric functions. We could add conditions to skip the bootstrap, I guess, but not sure complicating the code is worth the time saved? The metric calculation is generally pretty quick compared to the task setup and the model responses.
conditions to skip the bootstrap
Seems a good idea to have --bootstrap_iters
in __main__.py
as it is hardcoded in simple_evaluate
now. Then, at he moment, it can be used along with --predict_only
: --bootstrap_iters=0
to save time.
generally pretty quick
I was debugging a task with the --limit
of 10 on CPU with T5ForConditionalGeneration model of ~220M parameters, and metric calculations with matthews_corrcoef seemed pretty slow to me. Bigger models on GPUs ran for >24 hours on multiple benchmarks may still benefit from not calculating metrics if it is explicitly set to do so, even if it saves 1 hour.
If I may suggest, with evaluator refactoring
https://github.com/EleutherAI/lm-evaluation-harness/pull/1441 may it be split to several functions/classes — getting requests from model, and calculating metrics separately (and this can be under not args.predict_only
) condition?
Seems a good idea to have
--bootstrap_iters
in__main__.py
as it is hardcoded insimple_evaluate
now. Then, at he moment, it can be used along with--predict_only
:--bootstrap_iters=0
to save time.
If --predict_only
is working properly then it shouldn't use the matthews_corrcoef
or any other metric. The idea was to replace the metric with this:
https://github.com/EleutherAI/lm-evaluation-harness/blob/8680e9386de5c4ad745a88b8726707a15f10cc65/lm_eval/api/metrics.py#L19-L21
Was that not the case? Will see if I can isolate the bug. I'll add a condition to skip bootstrapping just to be safe. Thanks!
If I may suggest, with evaluator refactoring https://github.com/EleutherAI/lm-evaluation-harness/pull/1441 may it be split to several functions/classes — getting requests from model, and calculating metrics separately (and this can be under not args.predict_only) condition?
I like this idea! but probably belongs in another PR. #1441 is pretty bloated as it is.
@LSinev fixed the bug in #1441. There was an indention bug and predict_only
was only true for generation tasks.