lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Add `--predict_only` mode (run without scoring outputs)

Open haileyschoelkopf opened this issue 1 year ago • 12 comments

As asked for by @dwadden from AI2.

We should support a CLI flag, --predict_only, which causes model outputs to be produced and saved, but to exit while reporting metric = N/A for all metrics. This is a useful feature in general.

This could also be extended, if we so desire, to code execution tasks--so that in lm-evaluation-harness, we support the ability to run the generation passes needed for a HumanEval(+) score, but tell the user to take those model outputs and score them offline, properly sandboxed, at their own risk.

haileyschoelkopf avatar Dec 18 '23 13:12 haileyschoelkopf

cc @StellaAthena --would you be willing to allow for us to support code generation benchmarks if we did not perform the code generation online, and just supported the ability to "dry-run" the model outputs?

I think this would be a good medium where we can still support code tasks (which are in high demand and people would appreciate having in LM-Eval-Harness). Currently, the standard practice is to just run the code without sandboxing in the community (e.g. Bigcode Eval Harness has just one CLI flag that turns on code execution), which is quite a bit different from before that bridge had been crossed, and we could still do significantly better / help raise the standard by not allowing for online evaluation of this code within our library and pointing to security practices users should take.

haileyschoelkopf avatar Dec 18 '23 13:12 haileyschoelkopf

would a metric: bypass be a good abstraction do you think?

baberabb avatar Dec 18 '23 14:12 baberabb

Yes, I think some "don't run" metric like that would be a good choice, for things like code tasks where we'd want that task's metric to be always bypassed at runtime.

haileyschoelkopf avatar Dec 18 '23 14:12 haileyschoelkopf

cc @StellaAthena --would you be willing to allow for us to support code generation benchmarks if we did not perform the code generation online, and just supported the ability to "dry-run" the model outputs?

Yes, this is an excellent idea.

StellaAthena avatar Dec 18 '23 16:12 StellaAthena

Fantastic, I'm going to make trackers for some major code benchmarks soon then.

haileyschoelkopf avatar Dec 18 '23 16:12 haileyschoelkopf

@haileyschoelkopf @baberabb I'm not sold on the bypass metric. Instead we could use the predict_only flag to skip evaluations.

https://github.com/EleutherAI/lm-evaluation-harness/blob/42730d90b388931e336c3071f8b0bbe0fcb69493/lm_eval/evaluator.py#L366-L386

my understanding is that for this mode, we need to return resps in alternative format to the output from --log_samples? In that case everything after this snippet (aggregation, constructing the dict for printing tasks) could be skipped over.

lintangsutawika avatar Dec 19 '23 07:12 lintangsutawika

This is a good impetus to rework the metrics+aggregation and process_results method anyway.

lintangsutawika avatar Dec 19 '23 07:12 lintangsutawika

I think that would be much neater than passing dummy metric calls which I was considering but we still need something in the configs for tasks like HumanEval, where evaluation metrics are unsupported and we only allow output generations?

baberabb avatar Dec 19 '23 08:12 baberabb

I think we can write a warning/check for when --predict_only but the config in use has a list of metrics that it won't be used or printed. I imagine there will be cases outside of code evaluations were a user also wants the outputs (batch inference and analysis for example).

lintangsutawika avatar Dec 19 '23 09:12 lintangsutawika

@lintangsutawika there are a few reasons I like this bypass abstraction—

  • allows for us to run both a code eval task and a task that does do scoring simultaneously, and just have the code task not score but the others will score correctly
  • I like having a dummy results table printed because it makes it clear to the user their run was successful with all the tasks inferencing correctly just not printing true scores.

my understanding is that for this mode, we need to return resps in alternative format to the output from --log_samples?

I don’t think this is the case? We can just set up the offline eval scripts to receive the log_samples format

haileyschoelkopf avatar Dec 19 '23 11:12 haileyschoelkopf

We could think of a way that still prints the final table. But a metric that does the pass through feels a tad hackish and contrary to the idea of improving metrics+aggregation.

Also, log_samples actually already does save models resps which might already satisfy what's needed for offline eval but still doesnt address bypassing

lintangsutawika avatar Dec 19 '23 12:12 lintangsutawika

With the aggregate metric, the hack-iness does decrease quite a bit IMO. Plus would allow for tasks without targets rather than needing to put in a dummy column 😂.

baberabb avatar Dec 20 '23 10:12 baberabb

Is it intended behaviour that with --predict_only argument, metrics are still reported to calculate (which may take time for bootstrappable metrics, for example, that is how i found this — from printed logs) by this (i suppose) code https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L495-L517 (or am i missing some setup in my custom made task from the Task class)?

Only model outputs will be saved and metrics will not be evaluated.

I read this description of the argument.

LSinev avatar Feb 20 '24 08:02 LSinev

Is it intended behaviour that with --predict_only argument, metrics are still reported to calculate (which may take time for bootstrappable metrics, for example, that is how i found this — from printed logs) by this (i suppose) code https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/evaluator.py#L495-L517 (or am i missing some setup in my custom made task from the Task class)?

Only model outputs will be saved and metrics will not be evaluated.

I read this description of the argument.

yeah it's a bit of a hack right now. We just send the responses through noop dummy metric functions. We could add conditions to skip the bootstrap, I guess, but not sure complicating the code is worth the time saved? The metric calculation is generally pretty quick compared to the task setup and the model responses.

baberabb avatar Feb 20 '24 10:02 baberabb

conditions to skip the bootstrap

Seems a good idea to have --bootstrap_iters in __main__.py as it is hardcoded in simple_evaluate now. Then, at he moment, it can be used along with --predict_only: --bootstrap_iters=0 to save time.

generally pretty quick

I was debugging a task with the --limit of 10 on CPU with T5ForConditionalGeneration model of ~220M parameters, and metric calculations with matthews_corrcoef seemed pretty slow to me. Bigger models on GPUs ran for >24 hours on multiple benchmarks may still benefit from not calculating metrics if it is explicitly set to do so, even if it saves 1 hour.

If I may suggest, with evaluator refactoring https://github.com/EleutherAI/lm-evaluation-harness/pull/1441 may it be split to several functions/classes — getting requests from model, and calculating metrics separately (and this can be under not args.predict_only) condition?

LSinev avatar Feb 20 '24 11:02 LSinev

Seems a good idea to have --bootstrap_iters in __main__.py as it is hardcoded in simple_evaluate now. Then, at he moment, it can be used along with --predict_only: --bootstrap_iters=0 to save time.

If --predict_only is working properly then it shouldn't use the matthews_corrcoef or any other metric. The idea was to replace the metric with this: https://github.com/EleutherAI/lm-evaluation-harness/blob/8680e9386de5c4ad745a88b8726707a15f10cc65/lm_eval/api/metrics.py#L19-L21

Was that not the case? Will see if I can isolate the bug. I'll add a condition to skip bootstrapping just to be safe. Thanks!

If I may suggest, with evaluator refactoring https://github.com/EleutherAI/lm-evaluation-harness/pull/1441 may it be split to several functions/classes — getting requests from model, and calculating metrics separately (and this can be under not args.predict_only) condition?

I like this idea! but probably belongs in another PR. #1441 is pretty bloated as it is.

baberabb avatar Feb 20 '24 12:02 baberabb

@LSinev fixed the bug in #1441. There was an indention bug and predict_only was only true for generation tasks.

baberabb avatar Feb 20 '24 12:02 baberabb