ChiYeung Law comments

Results 39 comments of


                                            ChiYeung Law

How to reproduce the result of 59.8 on humaneval with your wizardcoder? The result I reproduced is only 47.

From your figure, I find that you evaluate WizardCoder with the Humaneval-X. This is different from OpenAI's HumanEval. To reproduce the results, you can follow [this](https://github.com/nlpxucan/WizardLM/tree/main/WizardCoder#how-to-reproduce-the-598-pass1-on-humaneval-with-greedy-decoding). Or you can try...

How to reproduce the result of 59.8 on humaneval with your wizardcoder? The result I reproduced is only 47.

We have done three experiments: 1. n=1, Greedy Decoding, pass@1 on HumanEval is 59.8 2. n=20, temperature=0.2 top_p=0.95, pass@1 on HumanEval is 57.3 3. n=200, temperature=0.2 top_p=0.95 pass@1 on HumanEval...

The token limit for wizardcoder

The token limit is the same as StarCoder (8192).

When will the version based on llama2 be released?

WizardLM based on llama2 has already been released. ![image](https://github.com/nlpxucan/WizardLM/assets/31592607/e11556a6-b7e5-469f-a237-a886d2281e8d)

Expirement with using RepSet of 196k for EvolInstruct 1k

Thank you for your suggestions. We will read this paper.

Exactly same output generations for the same prompt

Have you tried more samples?

Exactly same output generations for the same prompt

I think you have done some wrong, but I cannot figure it out from your config. We check the generated results on humaneval with n=20. They are not the same.

pass@1 on mbpp

![dd2a8500f82222b3e475485f987ff49](https://github.com/nlpxucan/WizardLM/assets/31592607/f59f3021-310b-4b21-8f44-ed534694d263) The 43.6 score is evaluated on Google's MBPP with 500 problems. Our WizardCoder is also evaluated on the same data. The 52.7 is evaluated on [MultiPL-E's MBPP (397 problems)](https://huggingface.co/datasets/bigcode/MultiPL-E-completions).

pass@1 on mbpp

We follow the same prompt as Eval Harness to evaluate StarCoder on MBPP.