code-eval
code-eval copied to clipboard
Run evaluation on LLMs using human-eval benchmark
I'm keeping https://github.com/ErikBjare/are-copilots-local-yet up-to-date, and would love to see some codellama numbers given it's now SOTA :)

I got only 9.7% for llama2-7B-chat on human-eval using your script ``` python {'pass@1': 0.0975609756097561} ```
建议支持 https://github.com/THUDM/CodeGeeX2 刚刚发布,根据公布1@pass数据达到了35.9