CodeGen icon indicating copy to clipboard operation
CodeGen copied to clipboard

Why tokenizer.pad_token == args.pad (i.e., 50256)??

Open 9yte opened this issue 2 years ago • 0 comments

Hi,

For my project, I'm trying to fine-tune CodeGen models on my dataset and evaluate the resulting fine-tuned model on the HumanEval benchmark dataset. I have a few questions that I would appreciate if you could address.

  1. First, why in the sampling code, at line 234, we have tokenizer.pad_token == args.pad, which is 50256. Shouldn't we set the pad_token to eos_token, not 50256 (which is the eos_token_id)? I'm confused by this. At line 240, you set the parameter pad_token_id=args.pad. So in your sampling code, both pad_token and pad_token_id are set to 50256. Can you please elaborate on this? That would be super helpful.

  2. As a baseline, I need to replicate your single-turn HumanEval benchmark results, but unfortunately, I'm getting surprisingly lower results compared to what is reported in the paper. And, I'm 99% positive that I'm probably missing a point. To produce Table 1 results in the paper, did you use the exact same sampling procedure as sample.py?

Thanks a lot for your time.

9yte avatar May 02 '23 18:05 9yte