simple-evals issues

Fix types overriding the stdlib module "types"

1

Fix for #2 . Simply renames types.py to eval_types.py

Remove blobfile dep, load directly from URL

Following up on #1 - loading directly through urllib instead of Blobfile. Azure still tries to invoke credentials, even for a public file, and doesn't seem to work (for me...

arkadyark-cohere

types is overriding the stdlib module "types"

The types.py file is overriding the stdlib module "types".

garethpaul

Run benchmarks also for GPT-3.5 versions and Claude Sonnet and Haiku

1

Please add them as well. ```[tasklist] ### Tasks - [ ] Run benchmarks also for GPT-3.5 versions and Claude Sonnet and Haiku #7 ```

zurferr

Use correct `_pack_message` function name

1

There is a small typo in `humaneval_eval.py` where a non-existent method named `_pack_mesage` is called. This PR uses the correct function name.

andrewmbenton

Has anyone run this code and cached the granular data?

I'm a student working on a final project and wanted to use the granular data here (e.g. not “GPT-4o hits 88.7% on MMLU" but rather “what did it answer for...

tval2

This PR fix typo in humaneval_eval.py Before fix: ``` sampler._pack_mesage(role="user", content=instruction + sample["prompt"]) ``` after fix: ``` sampler._pack_message(role="user", content=instruction + sample["prompt"]) ```

dongZheX

Run benchmarks for old GPT-4 models (GPT-4-0314 and GPT-4-0613) and all GPT-3.5-turbo models

Zero-shot scores for those models are not easily googleable — so this would be very useful for looking at the improvement trend over time!

mikita-apollo

Add itemized scores?

It would be useful to have access to tables with scores for individual evaluation items, as argued here: https://www.science.org/doi/pdf/10.1126/science.adf6369

OlivierBinette

simple-evals
simple-evals copied to clipboard

Metadata

Added Chartqa Dataset

Fix types overriding the stdlib module "types"

Remove blobfile dep, load directly from URL

types is overriding the stdlib module "types"

Run benchmarks also for GPT-3.5 versions and Claude Sonnet and Haiku

Use correct `_pack_message` function name

Has anyone run this code and cached the granular data?

fix typo

Run benchmarks for old GPT-4 models (GPT-4-0314 and GPT-4-0613) and all GPT-3.5-turbo models

Add itemized scores?

← Metadata

Owner

Metadata

simple-evals simple-evals copied to clipboard

Metadata

← Metadata

Owner

Metadata

simple-evals
simple-evals copied to clipboard