llama-models
llama-models copied to clipboard
HumanEval Evaluation Details
Could you share about the evaluation of code completion tasks such as HumanEval and HumanEval+? Particularly for the evaluation of pre-trained models, and what prompts were used.
I was able to infer the prompt used for post-trained evaluation for HumanEval here, but there is no corresponding results in the evals for the pre-trained models here.
I have used both VLLM and HF to generate outputs greedily and have never been able to achieve the results stated in the technical report for pre-trained models. On top of that, I have played with the batch sizes to remove padding and run inferences with padding too.
- Llama-3.1-8B [Reported: 37.2 +/- 7.4, Replicated Results: 23.78]
- Llama-3.1-70B [Reported: 58.5 +/- 7.5, Replicated Results: 15.18]
More details on this evaluation is much appreciated. Thank you!