AgentBench
AgentBench copied to clipboard
[Bug/Assistance] DBbench任务评测结果与leaderboard不一致
运行的是dbbench-std任务,worker数量5。开源模型都来自Huggingface,用fastchat部署
使用模型 | 实际分数 | Leaderboard分数 |
---|---|---|
gpt-3.5-turbo-0613 | 37.667 | 15.00 |
llama2-13b-chat | 25.00 | 4.50 |
chatglm3-6b | 34.99 | - |
chatglm2-6b | - | 13.67 |
Screenshots of gpt-3.5-turbo-0613 result <overall.json> { "total": 300, "validation": { "running": 0.0, "completed": 0.58, "agent context limit": 0.0, "agent validation failed": 0.37333333333333335, "agent invalid action": 0.0, "task limit reached": 0.04666666666666667, "unknown": 0.0, "task error": 0.0, "average_history_length": 7.64, "max_history_length": 34, "min_history_length": 4 }, "custom": { "other_accuracy": 0.23529411764705882, "counting_accuracy": 0.11764705882352941, "comparison_accuracy": 0.17647058823529413, "ranking_accuracy": 0.23529411764705882, "aggregation-SUM_accuracy": 0.125, "aggregation-MIN_accuracy": 0.25, "aggregation-MAX_accuracy": 0.0, "aggregation-AVG_accuracy": 0.5, "SELECT_accuracy": 0.2, "INSERT_accuracy": 0.24, "UPDATE_accuracy": 0.69, "overall_cat_accuracy": 0.37666666666666665 } }
想请问db的最后得分是不是overall.json结果里面这个overall_cat_accuracy分数(根据论文内容是Select/Insert/Update的平均,就是这个),其他别的分值也更对应不上。
另外别的任务也发现了类似的得分对应不上的情况,比如kg的得分一直都是0 #69
请问这个评估分数对应不上情况是正常的吗,会是什么原因导致的?