evals icon indicating copy to clipboard operation
evals copied to clipboard

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Results 428 evals issues
Sort by recently updated
recently updated
newest added

### Describe the feature or improvement you're requesting Adding this conversational UI would enable people to 'talk' directly with the backend and API requests to be carried out more effectively....

### Describe the bug There is a default 40s timeout for completion functions as per EVALS_THREAD_TIMEOUT. In eval.py, when evaluating a sample times out, it is retried. However it appears...

bug

# Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed...

### Describe the feature or improvement you're requesting In many production scenarios, it is important to do cost-benefit analysis, and it will be great if `oaieval` command can also return...

### Describe the feature or improvement you're requesting I have already had the output (generated from LLM) and ideal_answers in my jsonl file. For a look: ``` {'input': 'what is...

### Add support for function call. I would like to `eval` based on prompts that utilize `function_call`. From I have seen in the code, it's not possible at the moment....

### Describe the feature or improvement you're requesting The have been numerous improvements and fixes to the evals framework itself over the past few months, but these haven't been released...

### Describe the feature or improvement you're requesting It would be helpful for the next iteration of the generative pre trained model to learn how to identify any and all...

### Generic question about the accuracy score and boostrap_std metric When I run an eval, I got the following report. `{'accuracy': 0.6, 'boostrap_std': 0.1423900220076777}` How to decide the accuracy is...

### Describe the feature or improvement you're requesting Hi. I was wondering if there is any websites where I can share and see others' evaluation results. Should I run every...