opencompass [Feature] 代码相关数据集评测保留更详细的评测信息

[Feature] 代码相关数据集评测保留更详细的评测信息

Open II-Matto opened this issue 1 year ago • 2 comments

描述该功能

对于HumanEval这样的代码相关的数据集，评测时保存哪些错了哪些对了，以及具体的执行报错信息（traceback或者至少error message，能指向具体的代码行最好），这样方便查找问题，以及排除后处理等的影响、区分一些trivial的错误（如缺少import）等。如果能针对错误类型再做一个类似报表的统计，那就更好了。

是否希望自己实现该功能？

[ ] 我希望自己来实现这一功能，并向 OpenCompass 贡献代码！

Oct 17 '23 09:10 II-Matto

Thanks for the feature request, we will add this feature into our backlog of Q4. PR are also welcomed! Thanks again.

Oct 17 '23 10:10 tonysy

Have you ever tested the performance of APIs such as GPT on the human eval dataset, and how did you test it?

Nov 30 '23 07:11 ALLISWELL8

Have you ever tested the performance of APIs such as GPT on the human eval dataset, and how did you test it?

Please check our documentation for more details.

Feb 28 '24 14:02 tonysy

You can use --dump-eval-details currently.https://github.com/open-compass/opencompass/blob/001e77fea236276aa8018b34cd23076145ab1672/run.py#L127

Feel free to re-open if needed.

Feb 28 '24 14:02 tonysy

opencompass opencompass copied to clipboard

[Feature] 代码相关数据集评测保留更详细的评测信息

描述该功能

是否希望自己实现该功能？

opencompass
opencompass copied to clipboard