codellama Single-line Infilling Results reproduction

Hello,

I am trying to reproduce the infilling results on HumanEval (Table 14 CodeLLAMA 7B SPM, pass@1=83%). I am using the single-line benchmark from https://github.com/openai/human-eval-infilling. I use the below code to generate the samples.

from human_eval_infilling.data import write_jsonl, read_problems
from tqdm import tqdm
from transformers import (
    AutoModelForCausalLM,
    CodeLlamaTokenizer,
)
import torch


model_name = "codellama/CodeLlama-7b-hf"
load_in_8bit = "False"
device_map = "auto"
max_gen_len = 128
problems = read_problems(benchmark_name="single-line")
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_8bit=load_in_8bit,
        device_map=device_map,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16
    )
tokenizer = CodeLlamaTokenizer.from_pretrained(model_name, trust_remote_code=True, suffix_first=True)

def generate_one_completion(pre,suf):
    prompt = pre+"<FILL_ME>"+suf
    input_ids = tokenizer(prompt, suffix_first=True, return_tensors="pt")["input_ids"].to('cuda')
    generation_tokens = model.generate(
            input_ids,
            max_new_tokens=max_gen_len,
            temperature=0.2
        )
    outputs = tokenizer.batch_decode(generation_tokens[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
    return outputs

num_samples_per_task = 1
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"], problems[task_id]["suffix"]))
    for task_id in tqdm(problems)
    for _ in range(num_samples_per_task)
]
print(len(samples))
write_jsonl("samples_base_pretrained_codellama.jsonl", samples)

Next I run the following for computing pass@1. I obtain pass@1= 0.73281 which is much smaller than the reported results.

evaluate_infilling_functional_correctness samples_base_pretrained_codellama.jsonl --benchmark_name=single-line

Can you please help with the following:

Are the benchmarks and prompts for evaluation correct?
Is there any post-processing required on the generated codes? (e.g. code sanitation)
Are there any hyperparameter recommended? (e.g. temperature, decoding strategy?)

Dec 12 '23 08:12 shivamag125

I use the instruct model , only got {'pass@1': 0.05227492739593417} :(

But I use the raw <PRE> <SUF> <MID> tokens, as my test it works fine than <FILL_ME>

Jan 11 '24 10:01 timxx

Same issues. Cannot reproduce the infilling results as paper reported, a bit lower. Any ideas?

Jan 24 '24 07:01 stgzr

Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!

@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.

@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.

Jan 24 '24 10:01 faabian

Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!

@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.

@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.

Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using <EOT>?

Jan 24 '24 10:01 stgzr

Thanks! Using a stopping condition like \n reproduces the numbers.

Feb 06 '24 20:02 shivamag125

Dear @shivamag125 , @timxx and @stgzr, thanks for reporting! @timxx : The instruction models are not intended to be used for infilling, please use the pretrained models. @shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.

Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using <EOT>?

For multiline, there exist other stopping heuristics (see TruncationParameters here https://github.com/Eric-Wallace/codex/blob/main/infill_evaluation.py), but IIRC both https://github.com/bigcode-project/bigcode-evaluation-harness and our internal code use only EOT as stop symbol in multiline.

Feb 06 '24 22:02 faabian

codellama codellama copied to clipboard

Single-line Infilling Results reproduction

codellama
codellama copied to clipboard