Baber Abbasi issues

Results 21 issues of


                                            Baber Abbasi

Only a single `filtered_resps` is logged for repeat > 1 for each sample

For repeat > 1, model outputs (`resps`) for each model call are saved to file when using `--log_samples` but only a single `filtered_resp` (the first one?).

Add task table

closes #1160 You might have to change some security settings to give write permission to the workflow. I have set it up to run when it detects any tasks were...

Added the AgiEval benchmark as provided in the [Llama-2 paper](https://arxiv.org/pdf/2307.09288.pdf) (Appendix A2) (8 english multiple-choice datasets from [here](https://github.com/microsoft/AGIEval)). I tried to reproduce the results from the paper and although the...

feature request

add bypass metric

Started on #1152. Couple of issues: - [x] Cannot amend `metric_list` in evaluate - [x] print results crashing - [x] repeated generations are not saved in `log_samples`

Premature `num_fewshot` check with `fewshot_as_multiturn`

`main` raises a ValueError if `fewshot_as_multiturn` and no `num_fewshot`. However, `num_fewshot` can also be set in the task YAML, which is processed later. Think we should remove this especially because...

Better tests for API models

Right now the [tests](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/tests/models/test_api.py) for the API models mostly check if the payload is in the right format. Would be nice if we could mock an API and also test...

Evaluation fails when all samples are cached

When all samples are already cached, the process errors out instead (instead of skipping to the metric calculation) on the subsequent run due to lack of requests to pass on...

Duplicate `sample` entries

Looking at the samples generated from `gsm8k` and it seems like there is a separate entry for each filter now, except there is no indication of _which_ filter each entry...

PyPI release

Hi! Is it possible to cut a new version to PyPI. The current one installs all the optional dependencies and some of them have specific build requirements (e.g. `LTpycld2` requires...

HF: switch conditional checks to `self.backend` from `AUTO_MODEL_CLASS`

The conditions in HFLM now check for either `causal` or `seq2seq` rather than checking for the `AUTO_MODEL_CLASS`