ChiYeung Law
ChiYeung Law
From your figure, I find that you evaluate WizardCoder with the Humaneval-X. This is different from OpenAI's HumanEval. To reproduce the results, you can follow [this](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder#how-to-reproduce-the-598-pass1-on-humaneval-with-greedy-decoding). Or you can try...
We have done three experiments: 1. n=1, Greedy Decoding, pass@1 on HumanEval is 59.8 2. n=20, temperature=0.2 top_p=0.95, pass@1 on HumanEval is 57.3 3. n=200, temperature=0.2 top_p=0.95 pass@1 on HumanEval...
The token limit is the same as StarCoder (8192).
WizardLM based on llama2 has already been released. 
Thank you for your suggestions. We will read this paper.
Have you tried more samples?
I think you have done some wrong, but I cannot figure it out from your config. We check the generated results on humaneval with n=20. They are not the same.
 The 43.6 score is evaluated on Google's MBPP with 500 problems. Our WizardCoder is also evaluated on the same data. The 52.7 is evaluated on [MultiPL-E's MBPP (397 problems)](https://huggingface.co/datasets/bigcode/MultiPL-E-completions).
We follow the same prompt as Eval Harness to evaluate StarCoder on MBPP.