simple-evals issues

Update SimpleQAEval to use direct CSV URL instead of blobfile

Update SimpleQAEval to use direct CSV URL instead of blobfile Changes made: - Replace Azure blob storage path (az://) with direct HTTPS URL for accessing the SimpleQA dataset - Remove...

ZhangYiqun018

What is the recommended way to access the simpleQA dataset?

1

Hi, I'm encountering an "Access Failure" error (ResourceNotFound) when trying to read the simple_qa_test_set.csv file from the Azure path az://openaipublic/simple-evals/simple_qa_test_set.csv using pandas and blobfile. Is the simple-evals container publicly accessible?...

ZhangYiqun018

Clearly wrong / tampered prompt_id with "milf"

1

## File In the data file: "https://openaipublic.blob.core.windows.net/simple-evals/healthbench/2025-05-07-06-14-12_oss_eval.jsonl" ## prompt_id: 636e76dc-dd03-442a-a3bc-96c720ea57aa ## content of the prompt: > "Thick milfs really turn me on. I love when they're around 5’5 to 5’8,...

junghoon-son

Jules version of beginner friendly README

I told Jules to write a beginner friendly README.md file.

rarhs

Answer of a question in browsecomp may be wrong

Hi there, I just found that the answer of "An article was written by an author about the work of another author in 2011. The work analyzed was satirical in...

flibbertigibbet-Y

What dataset did math use?

Looking at the code below, it looks like it can be evaluated using the "math_test" or "math_500_test" dataset. https://github.com/openai/simple-evals/blob/ee3b0318d8d1d9d72755a4120879be65f7c07e9e/math_eval.py#L32 What dataset was used to obtain the scores listed in Benchmark...

yongho-chang

What to do now that simple-evals won't be updated?

2

Hi, I see that simple-evals won't be updated. That's understandable, but what should we do if we need to run evaluations on new tasks or datasets? Is there a recommended...

william-max-byte

healthbench_eval can not reproduced‌ 0.67 on the gpt-5

1

we can not reproduced‌ 0.67 on the gpt-5 using your healthbench_eval，don't know why，can you help me?

hitwangshuai

Cost of a Single Evaluation Using GPT-4.1 in HealthBench

1

I noticed that the HealthBench paper mentions: `Evaluating LLMs on clinical tasks is challenging and expensive. HealthBench, while leveraging GPT-4.1 as a judge to reduce human effort, still incurs API...

Rorschaaaach

fix: correct regex group extraction in browsecomp_eval.py

## Problem The `grade_sample` method in `browsecomp_eval.py` has a regex bug that prevents correct grading evaluation. **Current code:** ```python match = re.search(r"correct: (yes|no)", grading_response) return match.group(0) if match else "no"...

Neph0s

simple-evals
simple-evals copied to clipboard

Metadata

Update SimpleQAEval to use direct CSV URL instead of blobfile

What is the recommended way to access the simpleQA dataset?

Clearly wrong / tampered prompt_id with "milf"

Jules version of beginner friendly README

Answer of a question in browsecomp may be wrong

What dataset did math use?

What to do now that simple-evals won't be updated?

healthbench_eval can not reproduced‌ 0.67 on the gpt-5

Cost of a Single Evaluation Using GPT-4.1 in HealthBench

fix: correct regex group extraction in browsecomp_eval.py

← Metadata

Owner

Metadata

simple-evals simple-evals copied to clipboard

Metadata

← Metadata

Owner

Metadata

simple-evals
simple-evals copied to clipboard