Cannot Run Evalplus on Single Humaneval Problems
Hello,
I'm trying to find a way to run an evaluation on a single HumanEval problem. My goal is to have a much faster feedback loop for testing individual solutions without running the entire dataset, which can take several minutes.
I have tried two main approaches, but both have failed.
Attempt 1: Using a JSONL File
I created a single_problem.jsonl file with one entry for HumanEval/0.
{"task_id": "HumanEval/0", "solution": "from typing import List\n\ndef has_close_elements(...)..."}
When I ran the command:
evalplus.evaluate --dataset humaneval --samples single_problem.jsonl
I received this error, which seems to indicate the tool expects all 164 problems to be present:
Traceback (most recent call last):
File ".../evalplus/evaluate.py", line 243, in evaluate
assert len(completion_id) == len(problems), "Missing problems in samples"
AssertionError: Missing problems in samples
Attempt 2: Using the Directory Format
I followed the directory structure format, creating:
my_single_sample/ └── HumanEval_0/ └── 0.py
When I run this command:
python -m evalplus.evaluate --dataset humaneval --samples my_single_sample
I get the same AssertionError as with the JSONL method, suggesting the tool is checking the sample count against the problem count regardless of the format.
Traceback (most recent call last):
File ".../evalplus/evaluate.py", line 243, in evaluate
assert len(completion_id) == len(problems), "Missing problems in samples"
AssertionError: Missing problems in samples
My Question:
Is there a supported way to run evalplus.evaluate on a single sample? Or is the tool exclusively designed to run on the complete dataset?
Any guidance on how to achieve a fast, single-problem evaluation would be greatly appreciated.
Thanks!
I had the same question, and I’ve found a way to solve it.
The key to my solution lies in the Markdown file provided by the authors: evalplus/docs/cli.md. In the Code Evaluation section, the third line shows how to run evaluation with a self-defined dataset locally.
Here's what I did:
- I used the code in
evalplus/evalplus/data/humaneval.pyto download the HumanEval+ dataset — this works well. I tried to download directly but I couldn’t find a ready-to-use .jsonl or .gz file on Hugging Face, so downloading it programmatically was necessary. - I added a simple entry point at the end of the file to export the dataset:
if __name__ == "__main__":
resp = get_human_eval_plus()
output_path = "/home/humanevalplus.jsonl"
with open(output_path, 'w', encoding='utf-8') as outfile:
for key, value in resp.items():
try:
json.dump(value, outfile, ensure_ascii=False)
outfile.write('\n') # JSONL format: one object per line
print(f"Added case: {key}")
except Exception as e:
print(f"Error processing {key}: {e}")
This script downloads the full HumanEval+ dataset and saves it as a .jsonl file. You can then remove the cases you don't need. For testing, I kept only three cases: HumanEval/0, HumanEval/1, and HumanEval/2.
Before running the evaluation, remember to compress the JSONL file into a .gz archive. On Linux, you can use:
gzip -k your_jsonl_file.jsonl
Finally, I ran the evaluation using:
HUMANEVAL_OVERRIDE_PATH="/path/to/humanevalplus.jsonl.gz" evalplus.evaluate --dataset humaneval --samples samples.jsonl
The evaluation worked correctly, and the results looked reasonable: