evals icon indicating copy to clipboard operation
evals copied to clipboard

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Results 428 evals issues
Sort by recently updated
recently updated
newest added

### Discussed in https://github.com/openai/evals/discussions/621 Originally posted by **55255ru** April 10, 2023 Hello. I suggest using what is written on my website (which is here http://www.55255.ru) to improve GPT-4 because I...

It could be interesting to explore if we could use [MusPy](https://salu133445.github.io/muspy/) to add some text/symbolic music evals. /cc @salu133445

Idea for Eval

Hi would be cool to valuate all openai models on Beyond the Imitation Game Benchmark (BIG-bench) which is a collaborative benchmark intended to probe large language models and extrapolate their...

Idea for Eval

Hello everyone, thank you for contributions so far, I've been working through them and these tasks are forming a challenging a comprehensive benchmark for modern LLMs and LLM programs. We...

Idea for Eval

Noticed a few actions used in the workflows here are outdated, proposing a Dependabot configuration to update - reference https://docs.github.com/en/actions/security-guides/using-githubs-security-features-to-secure-your-use-of-github-actions#keeping-the-actions-in-your-workflows-secure-and-up-to-date Current workflow executions have a deprecation notice ex. https://github.com/openai/evals/actions/runs/8903656117 >...

### Describe the feature or improvement you're requesting Please add support for GPT4-o for evaluation . ### Additional context _No response_

# Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed...