evals icon indicating copy to clipboard operation
evals copied to clipboard

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Results 428 evals issues
Sort by recently updated
recently updated
newest added

### Describe the bug `oaieval` hangs near the end, before reporting, a lot. ### To Reproduce ``` ✗ EVALS_THREADS=12 EVALS_THREAD_TIMEOUT=10 oaieval gpt-3.5-turbo myeval [2023-06-28 18:09:56,280] [registry.py:266] Loading registry from /Users/username/development/evals/evals/registry/evals...

bug

### Describe the bug Running an eval that uses the `evals.elsuite.basic.includes:Includes` class via the latest pip package (1.0.3.post1) as a standalone app gives misleading results. ### To Reproduce 1. Install...

bug

### Describe the feature or improvement you're requesting I wonder if anyone has a solid method for evaluating code benchmarks like APPS. String typed codes can be very noisy and...

It seems that GPT-4 neglects to follow the instructions in the closedqa prompt much more than gpt-3.5-turbo. See, for example, https://github.com/openai/evals/issues/1200#issuecomment-1605238900 where gpt-4 gives 9 invalid responses out of 47,...

### Describe the bug Running oaieval fails with a UnicodeDecodeError at line 207 of registry.py. Adding the encoding solves the problem. See Code snippets. ### To Reproduce 1) Run the...

bug

### Describe the feature or improvement you're requesting The documentation in eval-templates.md describes `basic/match.py` as `Match: any([b.startswith(a) for b in B])` "[f]or a model completion `a` and a reference list...

### Describe the feature or improvement you're requesting First off - let me just say how great functions are! Game changer! It would be really cool if the functions would...

### Describe the feature or improvement you're requesting An idea for an eval which won't suffer from data contamination or overfitting. maybe use financial news before the opening bell for...

### Describe the feature or improvement you're requesting It will be possible to create a benchmark that will evaluate the ability of the model to extract key information from abstracts...

### Describe the feature or improvement you're requesting Like this: ~~~ # oaieval --version 1.0.3.post1 ~~~ ### Additional context _No response_