AgentBench icon indicating copy to clipboard operation
AgentBench copied to clipboard

[Bug/Assistance] DBbench任务评测结果与leaderboard不一致

Open SummerXIATIAN opened this issue 1 year ago • 1 comments

运行的是dbbench-std任务,worker数量5。开源模型都来自Huggingface,用fastchat部署

使用模型 实际分数 Leaderboard分数
gpt-3.5-turbo-0613 37.667 15.00
llama2-13b-chat 25.00 4.50
chatglm3-6b 34.99 -
chatglm2-6b - 13.67

Screenshots of gpt-3.5-turbo-0613 result <overall.json> { "total": 300, "validation": { "running": 0.0, "completed": 0.58, "agent context limit": 0.0, "agent validation failed": 0.37333333333333335, "agent invalid action": 0.0, "task limit reached": 0.04666666666666667, "unknown": 0.0, "task error": 0.0, "average_history_length": 7.64, "max_history_length": 34, "min_history_length": 4 }, "custom": { "other_accuracy": 0.23529411764705882, "counting_accuracy": 0.11764705882352941, "comparison_accuracy": 0.17647058823529413, "ranking_accuracy": 0.23529411764705882, "aggregation-SUM_accuracy": 0.125, "aggregation-MIN_accuracy": 0.25, "aggregation-MAX_accuracy": 0.0, "aggregation-AVG_accuracy": 0.5, "SELECT_accuracy": 0.2, "INSERT_accuracy": 0.24, "UPDATE_accuracy": 0.69, "overall_cat_accuracy": 0.37666666666666665 } }

想请问db的最后得分是不是overall.json结果里面这个overall_cat_accuracy分数(根据论文内容是Select/Insert/Update的平均,就是这个),其他别的分值也更对应不上。

另外别的任务也发现了类似的得分对应不上的情况,比如kg的得分一直都是0 #69

请问这个评估分数对应不上情况是正常的吗,会是什么原因导致的?

SummerXIATIAN avatar Dec 22 '23 15:12 SummerXIATIAN