opencompass
opencompass copied to clipboard
[Feature] 代码相关数据集评测保留更详细的评测信息
描述该功能
对于HumanEval这样的代码相关的数据集,评测时保存哪些错了哪些对了,以及具体的执行报错信息(traceback或者至少error message,能指向具体的代码行最好),这样方便查找问题,以及排除后处理等的影响、区分一些trivial的错误(如缺少import)等。如果能针对错误类型再做一个类似报表的统计,那就更好了。
是否希望自己实现该功能?
- [ ] 我希望自己来实现这一功能,并向 OpenCompass 贡献代码!
Thanks for the feature request, we will add this feature into our backlog of Q4. PR are also welcomed! Thanks again.
Have you ever tested the performance of APIs such as GPT on the human eval dataset, and how did you test it?
Have you ever tested the performance of APIs such as GPT on the human eval dataset, and how did you test it?
Please check our documentation for more details.
You can use --dump-eval-details
currently.https://github.com/open-compass/opencompass/blob/001e77fea236276aa8018b34cd23076145ab1672/run.py#L127
Feel free to re-open if needed.