lm-evaluation-harness Add HumanEval

Hi, I added the widely-used HumanEval benchmark. This partially resolves #1157.

The implementation relies on pass@k from the HF evaluate module, so it requires the environment variable HF_ALLOW_CODE_EVAL=1. To implement this, I also made two minimal changes to lm-eval:

HumanEval needs to concatenate the prompt and completion to build the full output code. I added a custom filter to utilize custom Python functions.
To estimate pass@k, multiple model-generated strings should be passed to the metric function. I fixed type casting of gold in ConfigurableTask.process_results.

Here are some evaluations I ran for a sanity check. Due to limited resources, I used greedy generation(humaneval_greedy). The versions used were torch==2.3.1 and transformers==4.41.2.

Models	reference (see below)	lm-eval (bsz=1)	lm-eval (bsz=32)
Meta-Llama-3-8B	0.3780	0.3780	0.3720
gemma-7b	0.3232	0.3232	0.3110
Qwen2-7B	0.4756	0.4756	0.5061
Mistral-7B-v0.3	0.2744	0.0122	0.0122

I found that greedy generation scores can vary with batch sizes, so I reported results for bsz=1 and bsz=32.

I also found that the poor performance of Mistral is due to its tokenizer. It changes the number of spaces when splitting continuation tokens from context tokens. For example:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
text = "\n    def foo(x):"
num_context_tokens = len(tokenizer.encode("\n", add_special_tokens=False))
print(text[1:])
# '    def foo(x):'
print(tokenizer.decode(tokenizer.encode(text, add_special_tokens=False)[num_context_tokens:]))
# '   def foo(x):'

However, I didn't attempt to fix it in this PR because it seems to change here, which may have a broader impact. Refer to the reference evaluation below for possible fixes.

Reference evaluation details

Based on official repo (https://github.com/openai/human-eval), I ran simple model generation through following script.

import os
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
from human_eval.data import write_jsonl, read_problems

STOP_STRINGS = [
    "\nclass",
    "\ndef",
    "\n#",
    "\nif",
    "\nprint",
]

def generate_one_completion(prompt, model, tokenizer):
    """Generate one completion for a given prompt."""
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].cuda()
    # Generate completion
    output = model.generate(
        input_ids=input_ids,
        tokenizer=tokenizer,
        do_sample=False,
        stop_strings=STOP_STRINGS,
        max_new_tokens=1024,
    )
    completion = tokenizer.decode(output[0], skip_special_tokens=True)
    completion = completion[len(prompt):]
    for stop_string in STOP_STRINGS:
        completion = completion.split(stop_string)[0]
    return completion

def main(args):
    model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype="auto").cuda()
    tokenizer = AutoTokenizer.from_pretrained(args.model)

    problems = read_problems()

    num_samples_per_task = 1
    samples = [
        dict(
            task_id=task_id,
            completion=generate_one_completion(
                problems[task_id]["prompt"],
                model,
                tokenizer,
            ),
        )
        for task_id in problems
        for _ in range(num_samples_per_task)
    ]

    model_name = os.path.basename(args.model)
    write_jsonl(f"{model_name}.jsonl", samples)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", type=str, required=True)
    args = parser.parse_args()

    main(args)

Then, I evaluated it as follows:

evaluate_functional_correctness $MODEL_NAME.jsonl

Jun 19 '24 11:06 hjlee1371

All committers have signed the CLA.

Jun 19 '24 11:06 CLAassistant

Thanks for adding this (super helpful)! I'm currently running humaneval using your PR.

Any ideas why greedy scores change when using different batch sizes? Seems odd to me, and I'm wondering if it indicates a bug.

Jul 16 '24 21:07 jasonkrone

It seems to be a more general and known issue (see https://github.com/huggingface/transformers/issues/26869 or https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535), but I'm not certain.

Jul 17 '24 00:07 hjlee1371

What is the status of this PR? Is there a reason why it hasn't landed?

Nov 01 '24 01:11 RylanSchaeffer

I just pulled this PR and rebased it against the current main (26f607f5432e1d09c55b25488c43523e7ecde657), I run into no issue.

I tested this with humaneval_greedy on the model meta-llama/Llama-3.1-8B-Instruct, the results are:

|     Tasks      |Version|Filter|n-shot| Metric  |   |Value |   |Stderr|
|----------------|------:|------|-----:|---------|---|-----:|---|-----:|
|humaneval_greedy|      1|n=1   |     0|pass_at_1|↑  |0.6341|±  |0.0377|

I think the code is correct, I will run this on many other models if you want more testing.

Nov 05 '24 16:11 RawthiL

Hi @RawthiL,

I am currently testing HumanEval using GPT-4o-mini as a proof of concept. However, I encountered an issue where the following command results in a 0 value for the pass_at_1 metric. Could you kindly guide me on why this might be happening? I would greatly appreciate your insights. Thanks.

lm_eval \ --model openai-chat-completions \ --model_args model=gpt-4o-mini,num_concurrent=5 \ --tasks humaneval_greedy \ --apply_chat_template

Output:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
humaneval_greedy	1	n=1	0	pass_at_1	↑	0	±	0

Nov 27 '24 02:11 johnsonafool

Hello! Any progress?

Jan 02 '25 09:01 veritas9872

Hi! Sorry this took so long! Just added some confirmation boilerplate to ensure we handle unsafe code safely. Appreciate your patience in bearing with me.

Jan 15 '25 18:01 baberabb

lm-evaluation-harness lm-evaluation-harness copied to clipboard

Add HumanEval

lm-evaluation-harness
lm-evaluation-harness copied to clipboard