ChiYeung Law

Results 39 comments of ChiYeung Law

From your figure, I find that you evaluate WizardCoder with the Humaneval-X. This is different from OpenAI's HumanEval. To reproduce the results, you can follow [this](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder#how-to-reproduce-the-598-pass1-on-humaneval-with-greedy-decoding). Or you can try...

We have done three experiments: 1. n=1, Greedy Decoding, pass@1 on HumanEval is 59.8 2. n=20, temperature=0.2 top_p=0.95, pass@1 on HumanEval is 57.3 3. n=200, temperature=0.2 top_p=0.95 pass@1 on HumanEval...

The token limit is the same as StarCoder (8192).

WizardLM based on llama2 has already been released. ![image](https://github.com/nlpxucan/WizardLM/assets/31592607/e11556a6-b7e5-469f-a237-a886d2281e8d)

Thank you for your suggestions. We will read this paper.

I think you have done some wrong, but I cannot figure it out from your config. We check the generated results on humaneval with n=20. They are not the same.

![dd2a8500f82222b3e475485f987ff49](https://github.com/nlpxucan/WizardLM/assets/31592607/f59f3021-310b-4b21-8f44-ed534694d263) The 43.6 score is evaluated on Google's MBPP with 500 problems. Our WizardCoder is also evaluated on the same data. The 52.7 is evaluated on [MultiPL-E's MBPP (397 problems)](https://huggingface.co/datasets/bigcode/MultiPL-E-completions).

We follow the same prompt as Eval Harness to evaluate StarCoder on MBPP.