human-eval icon indicating copy to clipboard operation
human-eval copied to clipboard

Function Name Mismatch in Prompt vs. Test Code – HumanEval/161

Open talhakabakus opened this issue 9 months ago • 1 comments

I’ve discovered a bug in the HumanEval dataset, specifically in Task ID: HumanEval/161, which causes incorrect evaluation of LLM-generated code.

In the prompt for this task, the function name is defined as solve, as shown below:

def solve(s):
    """You are given a string s.
    if s[i] is a letter, reverse its case from lower to upper or vise versa, 
    otherwise keep it as it is.
    If the string contains no letters, reverse the string.
    The function should return the resulted string.
    Examples
    solve("1234") = "4321"
    solve("ab") = "AB"
    solve("#a@C") = "#A@c"
    """

However, the test code for the same task calls a function named candidate:

def check(candidate):
    # Check some simple cases
    assert candidate("AsDf") == "aSdF"
    assert candidate("1234") == "4321"
    assert candidate("ab") == "AB"
    assert candidate("#a@C") == "#A@c"
    assert candidate("#AsdfW^45") == "#aSDFw^45"
    assert candidate("#6@2") == "2@6#"

    # Check some edge cases that are easy to work out by hand.
    assert candidate("#$a^D") == "#$A^d"
    assert candidate("#ccc") == "#CCC"

    # Don't remove this line:
    ...

Because LLMs typically generate a function named solve (as instructed by the prompt), this discrepancy leads to false negatives in test results. The generated function may be logically correct, but it fails the tests due to the incorrect function name.

Why this matters: This inconsistency causes misleading evaluation outcomes and can unfairly penalize model performance.

Suggested Resolution: Ensure the function name in the prompt examples matches the function name expected by the test code (candidate), or vice versa — ideally using a consistent and correct name like solve across both.

Additional Issue: Test Code is Not Executed Properly

Beyond the function name mismatch, I also noticed that the check() function in the test code is defined but never called. This causes an even more serious issue during evaluation: the assert statements inside check() are never executed.

talhakabakus avatar May 27 '25 13:05 talhakabakus

This issue appears to be an incorrect AI hallucination and should be closed.

If you look at the test harness, you can see that the test program is built by pasting the entry_point, which is solve in our case, into the check function call:

https://github.com/openai/human-eval/blob/6d43fb980f9fee3c892a914eda09951f772ad10d/human_eval/execution.py#L33

Or in other words, the following is appended to the test program:

check(solve)

99991 avatar Aug 25 '25 08:08 99991