HumanEval Evaluation Details

Open phoongkhangzhie opened this issue 1 year ago • 0 comments

Could you share about the evaluation of code completion tasks such as HumanEval and HumanEval+? Particularly for the evaluation of pre-trained models, and what prompts were used.

I was able to infer the prompt used for post-trained evaluation for HumanEval here, but there is no corresponding results in the evals for the pre-trained models here.

I have used both VLLM and HF to generate outputs greedily and have never been able to achieve the results stated in the technical report for pre-trained models. On top of that, I have played with the batch sizes to remove padding and run inferences with padding too.

Llama-3.1-8B [Reported: 37.2 +/- 7.4, Replicated Results: 23.78]
Llama-3.1-70B [Reported: 58.5 +/- 7.5, Replicated Results: 15.18]

More details on this evaluation is much appreciated. Thank you!

Oct 04 '24 06:10 phoongkhangzhie