lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Adding tinyBenchmarks datasets

Open LucWeber opened this issue 2 years ago • 9 comments

This PR adds tinyBenchmarks (paper) to the eval harness.

tinyBenchmarks currently already uses eval-harness for evaluation (see dataset-cards; e.g. tinyHellaswag), but requires users to create their own configs.

LucWeber avatar Mar 08 '24 18:03 LucWeber

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Mar 08 '24 18:03 CLAassistant

Thanks for the PR! We're glad that lm-eval-harness was useful to you during development :))

A few things:

  • Would you be able to sign the CLA and run the precommit for formatting?

I'm curious how long it takes to fit the IRT model. It would be nice to make people not have to run the tinybenchmarks code snippet separately and directly report the estimated full scores, although if you feel that this is more annoying than just following your readme instructions then this can be fine as is.

haileyschoelkopf avatar Mar 08 '24 20:03 haileyschoelkopf

Thanks for the PR! We're glad that lm-eval-harness was useful to you during development :))

A few things:

* Would you be able to sign the CLA and run the precommit for formatting?

I'm curious how long it takes to fit the IRT model. It would be nice to make people not have to run the tinybenchmarks code snippet separately and directly report the estimated full scores, although if you feel that this is more annoying than just following your readme instructions then this can be fine as is.

The IRT model is already fitted, so it is a very quick inference step that we are running here. We were thinking about this as well, but were not entirely sure how to best implement it. Do you have a recommendation on how to best integrate it? I assume that we best use a custom metric defined in a script that we include in the task folder, but how will the dependency be handled (the tinyBenchmarks package)? Thanks a lot for your help!

LucWeber avatar Mar 11 '24 10:03 LucWeber

IFEval would be a good metric to model off of! https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/ifeval

basically we'd add a lm_eval[tinybenchmarks] extra which would install your package, and import that module in utils.py for this task. And yep, could use a custom metric which is a function defined in utils for your task!

haileyschoelkopf avatar Mar 12 '24 19:03 haileyschoelkopf

I looked a bit into this it seems to me as if it is easy to modify the existing metrics acc and acc_norm (e.g. by adding a custom aggregation function, but it is not so easy to add an additional metric. For example, if I am adding a new metric like so:

...
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
  - metric: acc_norm
    aggregation: mean
    higher_is_better: true
  - metric: !function metrics.acc_irt
    aggregation: !function metrics.agg_irt
    higher_is_better: true

it won't show up in the outputs. To be able to add a custom metric, I would have to create my custom process_results function, which looks a bit intricate. Is that correct?

LucWeber avatar Mar 17 '24 16:03 LucWeber

@LucWeber what would you need for the acc_irt metric? In addition to the IRT model, would the inputs be the same as acc?

lintangsutawika avatar Apr 26 '24 08:04 lintangsutawika

@LucWeber what would you need for the acc_irt metric? In addition to the IRT model, would the inputs be the same as acc?

Hey @lintangsutawika, thanks for coming back to this PR. I think the inputs for the metric would be the same, but the aggregation function would require information about the specific benchmark that is evaluated. Something like this:

import tinybenchmarks as tb

def agg_gpirt(items: List[float], benchmark: str) -> float:
    items = np.array(items)
    predictions = tb.evaluate(items, benchmark)
    return predictions[benchmark]['gpirt'] 

LucWeber avatar Apr 26 '24 10:04 LucWeber

In that case, you could just substitute the aggregation in acc with your agg_irt. But if you want to also show acc than that becomes a bit tricky.

metric_list:
  - metric: acc
    aggregation: !function metrics.agg_irt
    higher_is_better: true

lintangsutawika avatar Apr 26 '24 13:04 lintangsutawika

In that case, you could just substitute the aggregation in acc with your agg_irt. But if you want to also show acc than that becomes a bit tricky.

metric_list:
  - metric: acc
    aggregation: !function metrics.agg_irt
    higher_is_better: true

I just pushed changes that add a custom aggregation function and use the acc_norm metric. I am now just outputting the most important metric from the paper, with an explanation of how to obtain the others.

LucWeber avatar Apr 27 '24 14:04 LucWeber

Hey,

is there something I can do at this stage to move the PR forward?

LucWeber avatar May 10 '24 14:05 LucWeber

Hi @LucWeber ! It should be good to go, I adjusted things (some task names were typos / conflicted with existing ones) but had wanted to let tests pass before merging. The failures are likely unrelated and I've since fixed what was causing them.

If you merge main into this branch they should pass and we can merge!

haileyschoelkopf avatar May 10 '24 19:05 haileyschoelkopf

Hey @haileyschoelkopf ,

great, that's done. Thanks so much. Let me know if there is something else.

LucWeber avatar May 13 '24 08:05 LucWeber