evals
evals copied to clipboard
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
### Describe the feature or improvement you're requesting Adding this conversational UI would enable people to 'talk' directly with the backend and API requests to be carried out more effectively....
### Describe the bug There is a default 40s timeout for completion functions as per EVALS_THREAD_TIMEOUT. In eval.py, when evaluating a sample times out, it is retried. However it appears...
# Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed...
### Describe the feature or improvement you're requesting In many production scenarios, it is important to do cost-benefit analysis, and it will be great if `oaieval` command can also return...
### Describe the feature or improvement you're requesting I have already had the output (generated from LLM) and ideal_answers in my jsonl file. For a look: ``` {'input': 'what is...
### Add support for function call. I would like to `eval` based on prompts that utilize `function_call`. From I have seen in the code, it's not possible at the moment....
### Describe the feature or improvement you're requesting The have been numerous improvements and fixes to the evals framework itself over the past few months, but these haven't been released...
### Describe the feature or improvement you're requesting It would be helpful for the next iteration of the generative pre trained model to learn how to identify any and all...
### Generic question about the accuracy score and boostrap_std metric When I run an eval, I got the following report. `{'accuracy': 0.6, 'boostrap_std': 0.1423900220076777}` How to decide the accuracy is...
### Describe the feature or improvement you're requesting Hi. I was wondering if there is any websites where I can share and see others' evaluation results. Should I run every...