Performance Results on HumanEval

Open htcml opened this issue 2 years ago • 1 comments

I am reading your CodeRL paper. It uses the APPS benchmark to show the performance comparison with Codex. Do you have any comparison results using the HumanEval dataset?

Feb 17 '23 01:02 htcml

@htcml thanks for reading the paper.

In our case, HumanEval dataset would not be the best evaluation benchmark. The reason is that HumanEval is treated as a docstring to code task in which the function signature and its docstring (in code comment block) is given. It is ideal for zero-shot evaluation for larger LMs such as CodeGen and Codex.

In our paper, we focus more on natural language text description of a problem and generate a program from scratch.

One workaround is that we can reformulate the HumanEval as text-to-code tasks but the comparison might not be fair with current baselines.

Feb 22 '23 08:02 henryhungle