Baber Abbasi
Baber Abbasi
For repeat > 1, model outputs (`resps`) for each model call are saved to file when using `--log_samples` but only a single `filtered_resp` (the first one?).
closes #1160 You might have to change some security settings to give write permission to the workflow. I have set it up to run when it detects any tasks were...
Added the AgiEval benchmark as provided in the [Llama-2 paper](https://arxiv.org/pdf/2307.09288.pdf) (Appendix A2) (8 english multiple-choice datasets from [here](https://github.com/microsoft/AGIEval)). I tried to reproduce the results from the paper and although the...
Started on #1152. Couple of issues: - [x] Cannot amend `metric_list` in evaluate - [x] print results crashing - [x] repeated generations are not saved in `log_samples`
`main` raises a ValueError if `fewshot_as_multiturn` and no `num_fewshot`. However, `num_fewshot` can also be set in the task YAML, which is processed later. Think we should remove this especially because...
Right now the [tests](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/tests/models/test_api.py) for the API models mostly check if the payload is in the right format. Would be nice if we could mock an API and also test...
When all samples are already cached, the process errors out instead (instead of skipping to the metric calculation) on the subsequent run due to lack of requests to pass on...
Looking at the samples generated from `gsm8k` and it seems like there is a separate entry for each filter now, except there is no indication of _which_ filter each entry...
Hi! Is it possible to cut a new version to PyPI. The current one installs all the optional dependencies and some of them have specific build requirements (e.g. `LTpycld2` requires...
The conditions in HFLM now check for either `causal` or `seq2seq` rather than checking for the `AUTO_MODEL_CLASS`