evalplus icon indicating copy to clipboard operation
evalplus copied to clipboard

Cannot Run Evalplus on Single Humaneval Problems

Open jasonzliang opened this issue 1 month ago • 1 comments

Hello,

I'm trying to find a way to run an evaluation on a single HumanEval problem. My goal is to have a much faster feedback loop for testing individual solutions without running the entire dataset, which can take several minutes.

I have tried two main approaches, but both have failed.

Attempt 1: Using a JSONL File

I created a single_problem.jsonl file with one entry for HumanEval/0.

{"task_id": "HumanEval/0", "solution": "from typing import List\n\ndef has_close_elements(...)..."}

When I ran the command:

evalplus.evaluate --dataset humaneval --samples single_problem.jsonl

I received this error, which seems to indicate the tool expects all 164 problems to be present:

Traceback (most recent call last):
  File ".../evalplus/evaluate.py", line 243, in evaluate
    assert len(completion_id) == len(problems), "Missing problems in samples"
AssertionError: Missing problems in samples

Attempt 2: Using the Directory Format

I followed the directory structure format, creating:

my_single_sample/ └── HumanEval_0/ └── 0.py

When I run this command:

python -m evalplus.evaluate --dataset humaneval --samples my_single_sample

I get the same AssertionError as with the JSONL method, suggesting the tool is checking the sample count against the problem count regardless of the format.

Traceback (most recent call last):
  File ".../evalplus/evaluate.py", line 243, in evaluate
    assert len(completion_id) == len(problems), "Missing problems in samples"
AssertionError: Missing problems in samples

My Question:

Is there a supported way to run evalplus.evaluate on a single sample? Or is the tool exclusively designed to run on the complete dataset?

Any guidance on how to achieve a fast, single-problem evaluation would be greatly appreciated.

Thanks!

jasonzliang avatar Oct 29 '25 16:10 jasonzliang

I had the same question, and I’ve found a way to solve it.

The key to my solution lies in the Markdown file provided by the authors: evalplus/docs/cli.md. In the Code Evaluation section, the third line shows how to run evaluation with a self-defined dataset locally.

Here's what I did:

  • I used the code in evalplus/evalplus/data/humaneval.py to download the HumanEval+ dataset — this works well. I tried to download directly but I couldn’t find a ready-to-use .jsonl or .gz file on Hugging Face, so downloading it programmatically was necessary.
  • I added a simple entry point at the end of the file to export the dataset:
if __name__ == "__main__":
    resp = get_human_eval_plus()
    output_path = "/home/humanevalplus.jsonl"
    with open(output_path, 'w', encoding='utf-8') as outfile:
        for key, value in resp.items():
            try:
                json.dump(value, outfile, ensure_ascii=False)
                outfile.write('\n')  # JSONL format: one object per line
                print(f"Added case: {key}")
            except Exception as e:
                print(f"Error processing {key}: {e}")

This script downloads the full HumanEval+ dataset and saves it as a .jsonl file. You can then remove the cases you don't need. For testing, I kept only three cases: HumanEval/0, HumanEval/1, and HumanEval/2.

Before running the evaluation, remember to compress the JSONL file into a .gz archive. On Linux, you can use:

gzip -k your_jsonl_file.jsonl

Finally, I ran the evaluation using:

HUMANEVAL_OVERRIDE_PATH="/path/to/humanevalplus.jsonl.gz" evalplus.evaluate --dataset humaneval --samples samples.jsonl

The evaluation worked correctly, and the results looked reasonable:

Image

RainingSea avatar Nov 03 '25 04:11 RainingSea