codellama Request for CodeLlama's Specific Production Parameters on Human-Eval Dataset

Dear Maintainer,

I hope this message finds you well. I have been trying to reproduce the performance of CodeLlama on the Human-Eval dataset, as mentioned in the paper. However, despite my best efforts, I have been unable to achieve the state-of-the-art (SOTA) accuracy reported in the paper.

To further investigate and understand the differences in results, I would greatly appreciate it if you could provide me with the specific production parameters used by CodeLlama on the Human-Eval dataset. Having access to these parameters would allow me to align my implementation more closely with the original work.

I understand and respect any concerns about confidentiality or limitations on sharing proprietary information. If it is not possible to disclose the exact production parameters, I would greatly appreciate any guidance or insights you can provide to help improve the accuracy of my implementation.

This is my evaluation result, and here is the code I used for generation:

Thank you for your time and consideration. I look forward to your response and any assistance you can provide.

Sep 09 '23 10:09 xxw11

Hello, I'm also attempting to reproduce the results of Codellama on Humaneval. Could you please clarify how you input the prompts from Humaneval into the model? Do you input the original prompts directly from Humaneval, or do you use any other templates?

thx

Sep 10 '23 11:09 wanghanbinpanda

Hello, I'm also attempting to reproduce the results of Codellama on Humaneval. Could you please clarify how you input the prompts from Humaneval into the model? Do you input the original prompts directly from Humaneval, or do you use any other templates?

thx

I made modifications to this code repository. https://github.com/abacaj/code-eval

Sep 10 '23 11:09 xxw11

Hello, I'm also attempting to reproduce the results of Codellama on Humaneval. Could you please clarify how you input the prompts from Humaneval into the model? Do you input the original prompts directly from Humaneval, or do you use any other templates? thx

I made modifications to this code repository. https://github.com/abacaj/code-eval

Thank you very much!

Sep 10 '23 12:09 wanghanbinpanda

Hey @xxw11, in the paper we present results on HumanEval pass@1 using greedy decoding, for pass@10 and pass@100 we used temperature of 0.8. It seems like you use temperature of 0.25, can you please the above mentioned setup?

Sep 12 '23 08:09 itaigat

Hello, I'm also attempting to reproduce the results of Codellama on Humaneval. Could you please clarify how you input the prompts from Humaneval into the model? Do you input the original prompts directly from Humaneval, or do you use any other templates? thx

I made modifications to this code repository. https://github.com/abacaj/code-eval

Hello, I tested the script you provided, but I found that many of the answers are empty. Have you encountered this issue? I hope to receive your response.

Sep 12 '23 09:09 ZHUANGMINGXI

Hello, I'm also attempting to reproduce the results of Codellama on Humaneval. Could you please clarify how you input the prompts from Humaneval into the model? Do you input the original prompts directly from Humaneval, or do you use any other templates? thx

I made modifications to this code repository. https://github.com/abacaj/code-eval

Hello, I tested the script you provided, but I found that many of the answers are empty. Have you encountered this issue? I hope to receive your response.

It might be worth verifying if there are any issues with the loading process of your model and tokenizer.

Sep 13 '23 01:09 xxw11

Hi @itaigat ,

I've been trying to reproduce CodeLlama-7b pass scores on HumanEval in the paper. I used bigcode-evaluation-harness for evalutaion with HumanEval task, using evaluate's code_eval as metric. The model I used is from HuggingFace. I used greedy decoding for pass@1 score. However, the pass@1 score I got in my local environment (29.9%) is different from the score in the paper (33.5%); the pass@10 score I got in my local environment (57.9%) is also different (59.6%).

Below is my code for evaluation:

        git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
	cd bigcode-evaluation-harness
	pip install e .

pass@1:

    python  main.py \ # under bigcode-evaluation-harness/
      --model codellama/CodeLlama-7b-hf \
      --tasks humaneval \
      --do_sample False \
      --n_samples 1 \
      --allow_code_execution \
      --save_generations

pass@10:

    python  main.py \
        --model codellama/CodeLlama-7b-hf \
        --tasks humaneval \
        --temperature 0.8 \
        --n_samples 10 \
        --top_p 0.95 \
        --allow_code_execution \
        --save_generations

I wonder if I left out something and I would be grateful if you could provide any suggestion or guidance for the code above to achieve the same score as in the paper. It would also be helpful if you could let me know if the evaluation is processed on GPU/CPU?

Thanks!

Oct 19 '23 06:10 violetch24

When using bigcode-evaluation-harness I'd suggest evaluating on the humaneval-unstripped task, which corresponds to the formatting we used for the numbers in the paper. For codellama/CodeLlama-7b-hf, I get 31.1% with the harness and greedy decoding for humaneval-unstripped.

This is still 2 percentage points worse then the 33.5% we got internally. I would blame the remaining gap on slight differences in inference engines. Tiny differences in computations can accumulate, in particular for smaller, less accurate models. E.g, models might be relatively unsure about tokens at specific predictions (as in, the difference between the predicted token and the next most-likely token would be very small) and since future tokens are conditioned on past tokens we can quickly end up with different outputs.

I looked at the results we got for HumanEval internally and with the harness for greedy decoding with CodeLlama-7b. There are 44 examples out of the 164 for which the output differs, and the result (wrt all tests passing) is different for 6 examples. In 5 cases, we see correct solutions whereas the harness produced failing code, and in 1 case our prediction fails whereas the solution with the harness passes. Hence we get 4 more examples correct, which corresponds to 2.4% absolute on HumanEval.

That being said, deltas due to inference engine differences should even out when computing pass@k or likely with stronger models. For example, computing pass@k on humaneval-unstripped from 200 samples with temperature=0.2 and top_p=0.95, with the harness I get: pass@1: 30.3; pass@10: 45.8; pass@100: 58.6. Internally, we get pass@1: 30.7; pass@10: 47.2; pass@100: 58.8 (compare with Fig. 6 in https://arxiv.org/abs/2308.12950).

Oct 25 '23 15:10 jgehring

codellama codellama copied to clipboard

Request for CodeLlama's Specific Production Parameters on Human-Eval Dataset

pass@1:

pass@10:

codellama
codellama copied to clipboard