Adding tinyBenchmarks datasets
This PR adds tinyBenchmarks (paper) to the eval harness.
tinyBenchmarks currently already uses eval-harness for evaluation (see dataset-cards; e.g. tinyHellaswag), but requires users to create their own configs.
Thanks for the PR! We're glad that lm-eval-harness was useful to you during development :))
A few things:
- Would you be able to sign the CLA and run the precommit for formatting?
I'm curious how long it takes to fit the IRT model. It would be nice to make people not have to run the tinybenchmarks code snippet separately and directly report the estimated full scores, although if you feel that this is more annoying than just following your readme instructions then this can be fine as is.
Thanks for the PR! We're glad that lm-eval-harness was useful to you during development :))
A few things:
* Would you be able to sign the CLA and run the precommit for formatting?I'm curious how long it takes to fit the IRT model. It would be nice to make people not have to run the tinybenchmarks code snippet separately and directly report the estimated full scores, although if you feel that this is more annoying than just following your readme instructions then this can be fine as is.
The IRT model is already fitted, so it is a very quick inference step that we are running here. We were thinking about this as well, but were not entirely sure how to best implement it. Do you have a recommendation on how to best integrate it? I assume that we best use a custom metric defined in a script that we include in the task folder, but how will the dependency be handled (the tinyBenchmarks package)? Thanks a lot for your help!
IFEval would be a good metric to model off of! https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/ifeval
basically we'd add a lm_eval[tinybenchmarks] extra which would install your package, and import that module in utils.py for this task. And yep, could use a custom metric which is a function defined in utils for your task!
I looked a bit into this it seems to me as if it is easy to modify the existing metrics acc and acc_norm (e.g. by adding a custom aggregation function, but it is not so easy to add an additional metric. For example, if I am adding a new metric like so:
...
metric_list:
- metric: acc
aggregation: mean
higher_is_better: true
- metric: acc_norm
aggregation: mean
higher_is_better: true
- metric: !function metrics.acc_irt
aggregation: !function metrics.agg_irt
higher_is_better: true
it won't show up in the outputs. To be able to add a custom metric, I would have to create my custom process_results function, which looks a bit intricate. Is that correct?
@LucWeber what would you need for the acc_irt metric? In addition to the IRT model, would the inputs be the same as acc?
@LucWeber what would you need for the
acc_irtmetric? In addition to the IRT model, would the inputs be the same asacc?
Hey @lintangsutawika, thanks for coming back to this PR. I think the inputs for the metric would be the same, but the aggregation function would require information about the specific benchmark that is evaluated. Something like this:
import tinybenchmarks as tb
def agg_gpirt(items: List[float], benchmark: str) -> float:
items = np.array(items)
predictions = tb.evaluate(items, benchmark)
return predictions[benchmark]['gpirt']
In that case, you could just substitute the aggregation in acc with your agg_irt. But if you want to also show acc than that becomes a bit tricky.
metric_list:
- metric: acc
aggregation: !function metrics.agg_irt
higher_is_better: true
In that case, you could just substitute the aggregation in
accwith youragg_irt. But if you want to also showaccthan that becomes a bit tricky.metric_list: - metric: acc aggregation: !function metrics.agg_irt higher_is_better: true
I just pushed changes that add a custom aggregation function and use the acc_norm metric. I am now just outputting the most important metric from the paper, with an explanation of how to obtain the others.
Hey,
is there something I can do at this stage to move the PR forward?
Hi @LucWeber ! It should be good to go, I adjusted things (some task names were typos / conflicted with existing ones) but had wanted to let tests pass before merging. The failures are likely unrelated and I've since fixed what was causing them.
If you merge main into this branch they should pass and we can merge!
Hey @haileyschoelkopf ,
great, that's done. Thanks so much. Let me know if there is something else.