simple-evals icon indicating copy to clipboard operation
simple-evals copied to clipboard

Results 57 simple-evals issues
Sort by recently updated
recently updated
newest added

Update SimpleQAEval to use direct CSV URL instead of blobfile Changes made: - Replace Azure blob storage path (az://) with direct HTTPS URL for accessing the SimpleQA dataset - Remove...

Hi, I'm encountering an "Access Failure" error (ResourceNotFound) when trying to read the simple_qa_test_set.csv file from the Azure path az://openaipublic/simple-evals/simple_qa_test_set.csv using pandas and blobfile. Is the simple-evals container publicly accessible?...

## File In the data file: "https://openaipublic.blob.core.windows.net/simple-evals/healthbench/2025-05-07-06-14-12_oss_eval.jsonl" ## prompt_id: 636e76dc-dd03-442a-a3bc-96c720ea57aa ## content of the prompt: > "Thick milfs really turn me on. I love when they're around 5’5 to 5’8,...

I told Jules to write a beginner friendly README.md file.

Hi there, I just found that the answer of "An article was written by an author about the work of another author in 2011. The work analyzed was satirical in...

Looking at the code below, it looks like it can be evaluated using the "math_test" or "math_500_test" dataset. https://github.com/openai/simple-evals/blob/ee3b0318d8d1d9d72755a4120879be65f7c07e9e/math_eval.py#L32 What dataset was used to obtain the scores listed in Benchmark...

Hi, I see that simple-evals won't be updated. That's understandable, but what should we do if we need to run evaluations on new tasks or datasets? Is there a recommended...

we can not reproduced‌ 0.67 on the gpt-5 using your healthbench_eval,don't know why,can you help me?

I noticed that the HealthBench paper mentions: `Evaluating LLMs on clinical tasks is challenging and expensive. HealthBench, while leveraging GPT-4.1 as a judge to reduce human effort, still incurs API...

## Problem The `grade_sample` method in `browsecomp_eval.py` has a regex bug that prevents correct grading evaluation. **Current code:** ```python match = re.search(r"correct: (yes|no)", grading_response) return match.group(0) if match else "no"...