lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Minor features
Features:
- disable
fewshot_as_multiturn
whenapply_chat_template
is not passed ornum_fewshots=0
. Why failing the run? For zero-shot setup multiturn==simple chat template, so no error at all. If chat_template is not enabled, then throw warning and disable multiturn (as long as it is not available without chat_template) - pass predict_only into filters apply method. Why? The filters are designed to be used even with additional ML models (reward, for example). Then if one runs lm-eval with
predict_only
this may mean that the filter is not to be used. No user may customize filters to usepredict_only
info to manage filters behaviour - add
filter_device
param from cli. There was a TODO about it. If I use another LLM as a filter, I may need to pass device that DIFFERS from one used to run the "main" LLM. Like llm-as-a-judge or LLMs to score the generations - disable
ensure_ascii
forapply_chat_template
method of TemplateAPI class. Now cyrillic symbols are stored in a valid form - add f1_macro and f1_micro metrics (aggregations in fact) to register to handle multi-class classification tasks
- new param into model_args for APIs - timeout. When running vLLM server and using lm-eval in OpenAI API mode to make requests into this server, timeout may be increased (like to run Llama-3.1-405B, for me I had lots of connection errors, that have been solved by increasing the timeout param)