lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Add HumanEval
Hi, I added the widely-used HumanEval benchmark. This partially resolves #1157.
The implementation relies on pass@k from the HF evaluate module, so it requires the environment variable HF_ALLOW_CODE_EVAL=1. To implement this, I also made two minimal changes to lm-eval:
- HumanEval needs to concatenate the prompt and completion to build the full output code. I added a
customfilter to utilize custom Python functions. - To estimate
pass@k, multiple model-generated strings should be passed to the metric function. I fixed type casting ofgoldinConfigurableTask.process_results.
Here are some evaluations I ran for a sanity check. Due to limited resources, I used greedy generation(humaneval_greedy). The versions used were torch==2.3.1 and transformers==4.41.2.
| Models | reference (see below) | lm-eval (bsz=1) | lm-eval (bsz=32) |
|---|---|---|---|
| Meta-Llama-3-8B | 0.3780 | 0.3780 | 0.3720 |
| gemma-7b | 0.3232 | 0.3232 | 0.3110 |
| Qwen2-7B | 0.4756 | 0.4756 | 0.5061 |
| Mistral-7B-v0.3 | 0.2744 | 0.0122 | 0.0122 |
I found that greedy generation scores can vary with batch sizes, so I reported results for bsz=1 and bsz=32.
I also found that the poor performance of Mistral is due to its tokenizer. It changes the number of spaces when splitting continuation tokens from context tokens. For example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")
text = "\n def foo(x):"
num_context_tokens = len(tokenizer.encode("\n", add_special_tokens=False))
print(text[1:])
# ' def foo(x):'
print(tokenizer.decode(tokenizer.encode(text, add_special_tokens=False)[num_context_tokens:]))
# ' def foo(x):'
However, I didn't attempt to fix it in this PR because it seems to change here, which may have a broader impact. Refer to the reference evaluation below for possible fixes.
Reference evaluation details
Based on official repo (https://github.com/openai/human-eval), I ran simple model generation through following script.import os
import argparse
from transformers import AutoModelForCausalLM, AutoTokenizer
from human_eval.data import write_jsonl, read_problems
STOP_STRINGS = [
"\nclass",
"\ndef",
"\n#",
"\nif",
"\nprint",
]
def generate_one_completion(prompt, model, tokenizer):
"""Generate one completion for a given prompt."""
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].cuda()
# Generate completion
output = model.generate(
input_ids=input_ids,
tokenizer=tokenizer,
do_sample=False,
stop_strings=STOP_STRINGS,
max_new_tokens=1024,
)
completion = tokenizer.decode(output[0], skip_special_tokens=True)
completion = completion[len(prompt):]
for stop_string in STOP_STRINGS:
completion = completion.split(stop_string)[0]
return completion
def main(args):
model = AutoModelForCausalLM.from_pretrained(args.model, torch_dtype="auto").cuda()
tokenizer = AutoTokenizer.from_pretrained(args.model)
problems = read_problems()
num_samples_per_task = 1
samples = [
dict(
task_id=task_id,
completion=generate_one_completion(
problems[task_id]["prompt"],
model,
tokenizer,
),
)
for task_id in problems
for _ in range(num_samples_per_task)
]
model_name = os.path.basename(args.model)
write_jsonl(f"{model_name}.jsonl", samples)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, required=True)
args = parser.parse_args()
main(args)
Then, I evaluated it as follows:
evaluate_functional_correctness $MODEL_NAME.jsonl
Thanks for adding this (super helpful)! I'm currently running humaneval using your PR.
Any ideas why greedy scores change when using different batch sizes? Seems odd to me, and I'm wondering if it indicates a bug.
It seems to be a more general and known issue (see https://github.com/huggingface/transformers/issues/26869 or https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535), but I'm not certain.
What is the status of this PR? Is there a reason why it hasn't landed?
I just pulled this PR and rebased it against the current main (26f607f5432e1d09c55b25488c43523e7ecde657), I run into no issue.
I tested this with humaneval_greedy on the model meta-llama/Llama-3.1-8B-Instruct, the results are:
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|----------------|------:|------|-----:|---------|---|-----:|---|-----:|
|humaneval_greedy| 1|n=1 | 0|pass_at_1|↑ |0.6341|± |0.0377|
I think the code is correct, I will run this on many other models if you want more testing.
Hi @RawthiL,
I am currently testing HumanEval using GPT-4o-mini as a proof of concept. However, I encountered an issue where the following command results in a 0 value for the pass_at_1 metric. Could you kindly guide me on why this might be happening? I would greatly appreciate your insights. Thanks.
lm_eval \ --model openai-chat-completions \ --model_args model=gpt-4o-mini,num_concurrent=5 \ --tasks humaneval_greedy \ --apply_chat_template
Output:
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| humaneval_greedy | 1 | n=1 | 0 | pass_at_1 | ↑ | 0 | ± | 0 |
Hello! Any progress?
Hi! Sorry this took so long! Just added some confirmation boilerplate to ensure we handle unsafe code safely. Appreciate your patience in bearing with me.