evals issues

Results 428 evals issues

Sort by recently updated

oaieval hangs a lot

### Describe the bug `oaieval` hangs near the end, before reporting, a lot. ### To Reproduce ``` ✗ EVALS_THREADS=12 EVALS_THREAD_TIMEOUT=10 oaieval gpt-3.5-turbo myeval [2023-06-28 18:09:56,280] [registry.py:266] Loading registry from /Users/username/development/evals/evals/registry/evals...

shamas-

bug

pip package is outdated (1.0.3.post1) - has broken elsuite/basic/includes.py

### Describe the bug Running an eval that uses the `evals.elsuite.basic.includes:Includes` class via the latest pip package (1.0.3.post1) as a standalone app gives misleading results. ### To Reproduce 1. Install...

phelps-sg

bug

Code Evals

### Describe the feature or improvement you're requesting I wonder if anyone has a solid method for evaluating code benchmarks like APPS. String typed codes can be very noisy and...

billxbf

closedqa prompt is not adequate for gpt-4-0613

It seems that GPT-4 neglects to follow the instructions in the closedqa prompt much more than gpt-3.5-turbo. See, for example, https://github.com/openai/evals/issues/1200#issuecomment-1605238900 where gpt-4 gives 9 invalid responses out of 47,...

JasonGross

oaieval fails with error UnicodeDecodeError: 'charmap' codec can't decode byte

### Describe the bug Running oaieval fails with a UnicodeDecodeError at line 207 of registry.py. Adding the encoding solves the problem. See Code snippets. ### To Reproduce 1) Run the...

rslinford

bug

Exact Match template

### Describe the feature or improvement you're requesting The documentation in eval-templates.md describes `basic/match.py` as `Match: any([b.startswith(a) for b in B])` "[f]or a model completion `a` and a reference list...

JasonGross

Functions: Support for minItems and maxItems for json schema array

### Describe the feature or improvement you're requesting First off - let me just say how great functions are! Game changer! It would be really cool if the functions would...

tjhiggins

stock picking as model eval

### Describe the feature or improvement you're requesting An idea for an eval which won't suffer from data contamination or overfitting. maybe use financial news before the opening bell for...

qrdlgit

Extracting and analyzing data from scientific articles

### Describe the feature or improvement you're requesting It will be possible to create a benchmark that will evaluate the ability of the model to extract key information from abstracts...

QuantumCoderr

Add version option to oaieval

### Describe the feature or improvement you're requesting Like this: ~~~ # oaieval --version 1.0.3.post1 ~~~ ### Additional context _No response_

phelps-sg

evals
evals copied to clipboard

Metadata

oaieval hangs a lot

pip package is outdated (1.0.3.post1) - has broken elsuite/basic/includes.py

Code Evals

closedqa prompt is not adequate for gpt-4-0613

oaieval fails with error UnicodeDecodeError: 'charmap' codec can't decode byte

Exact Match template

Functions: Support for minItems and maxItems for json schema array

stock picking as model eval

Extracting and analyzing data from scientific articles

Add version option to oaieval

← Metadata

Owner

Metadata

evals evals copied to clipboard

Metadata

← Metadata

Owner

Metadata

evals
evals copied to clipboard