codellama
codellama copied to clipboard
Single-line Infilling Results reproduction
Hello,
I am trying to reproduce the infilling results on HumanEval (Table 14 CodeLLAMA 7B SPM, pass@1=83%). I am using the single-line benchmark from https://github.com/openai/human-eval-infilling. I use the below code to generate the samples.
from human_eval_infilling.data import write_jsonl, read_problems
from tqdm import tqdm
from transformers import (
AutoModelForCausalLM,
CodeLlamaTokenizer,
)
import torch
model_name = "codellama/CodeLlama-7b-hf"
load_in_8bit = "False"
device_map = "auto"
max_gen_len = 128
problems = read_problems(benchmark_name="single-line")
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=load_in_8bit,
device_map=device_map,
trust_remote_code=True,
torch_dtype=torch.bfloat16
)
tokenizer = CodeLlamaTokenizer.from_pretrained(model_name, trust_remote_code=True, suffix_first=True)
def generate_one_completion(pre,suf):
prompt = pre+"<FILL_ME>"+suf
input_ids = tokenizer(prompt, suffix_first=True, return_tensors="pt")["input_ids"].to('cuda')
generation_tokens = model.generate(
input_ids,
max_new_tokens=max_gen_len,
temperature=0.2
)
outputs = tokenizer.batch_decode(generation_tokens[:, input_ids.shape[1]:], skip_special_tokens = True)[0]
return outputs
num_samples_per_task = 1
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"], problems[task_id]["suffix"]))
for task_id in tqdm(problems)
for _ in range(num_samples_per_task)
]
print(len(samples))
write_jsonl("samples_base_pretrained_codellama.jsonl", samples)
Next I run the following for computing pass@1. I obtain pass@1= 0.73281 which is much smaller than the reported results.
evaluate_infilling_functional_correctness samples_base_pretrained_codellama.jsonl --benchmark_name=single-line
Can you please help with the following:
- Are the benchmarks and prompts for evaluation correct?
- Is there any post-processing required on the generated codes? (e.g. code sanitation)
- Are there any hyperparameter recommended? (e.g. temperature, decoding strategy?)
I use the instruct model , only got {'pass@1': 0.05227492739593417}
:(
But I use the raw <PRE>
<SUF>
<MID>
tokens, as my test it works fine than <FILL_ME>
Same issues. Cannot reproduce the infilling results as paper reported, a bit lower. Any ideas?
Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!
@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.
@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.
Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!
@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.
@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.
Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using <EOT>?
Thanks! Using a stopping condition like \n reproduces the numbers.
Dear @shivamag125 , @timxx and @stgzr, thanks for reporting! @timxx : The instruction models are not intended to be used for infilling, please use the pretrained models. @shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.
Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using <EOT>?
For multiline, there exist other stopping heuristics (see TruncationParameters here https://github.com/Eric-Wallace/codex/blob/main/infill_evaluation.py), but IIRC both https://github.com/bigcode-project/bigcode-evaluation-harness and our internal code use only EOT as stop symbol in multiline.