Baber Abbasi comments

Results 51 comments of


                                            Baber Abbasi

Support wrapping prompts with a given Chat Template

I thought HF did a great job in [explaining](https://huggingface.co/docs/transformers/chat_templating) how their chat templating works.

Support wrapping prompts with a given Chat Template

> * How should we allow for users to specify a system prompt, if at all? IMO we probably should allow users to pass system prompts. The LLama-2 paper provides...

Support wrapping prompts with a given Chat Template

> Btw, the user/assistant turns is something we added in lighteval [here](https://github.com/huggingface/lighteval/blob/main/src/lighteval/few_shot_manager.py#L159) so if you want to reuse this snippet in the harness feel free to do so :) (Finished...

Is there a current way to run lm-eval against a self-hosted inference server?

Hi! vLLM does have a [drop in replacement](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#using-openai-completions-api-with-vllm:~:text=8000/v1/models-,Using%20OpenAI%20Completions%20API%20with%20vLLM,-%23) for the OpenAI API server which should work with minimal [modifications](https://github.com/EleutherAI/lm-evaluation-harness/blob/f0b9649146b30c966bf7fa159e3ad45133acc672/lm_eval/models/openai_completions.py#L58) by passing in the `base_url` and correcting the tokenizer to match....

Evaluating Arc Easy Got NonMatchingSplitsSizesError

I'm not getting this. Maybe try clearing the dataset in the cache and try again?

Add `--predict_only` mode (run without scoring outputs)

would a `metric: bypass` be a good abstraction do you think?

Add `--predict_only` mode (run without scoring outputs)

I think that would be much neater than passing dummy metric calls which I was considering but we still need something in the configs for tasks like HumanEval, where evaluation...

Add `--predict_only` mode (run without scoring outputs)

With the aggregate metric, the hack-iness does decrease quite a bit IMO. Plus would allow for tasks without targets rather than needing to put in a dummy column 😂.

Add `--predict_only` mode (run without scoring outputs)

> Is it intended behaviour that with `--predict_only` argument, metrics are still reported to calculate (which may take time for bootstrappable metrics, for example, that is how i found this...

Add `--predict_only` mode (run without scoring outputs)

> Seems a good idea to have `--bootstrap_iters` in `__main__.py` as it is hardcoded in `simple_evaluate` now. Then, at he moment, it can be used along with `--predict_only`: `--bootstrap_iters=0` to...