AgentBench issues

Would llama3 wizardlm2 and other latest models be tested and published in leaderboard? 请求添加llama3 wizardlm等24年4-5月大模型的测试结果

3

请求添加llama3 wizardlm等24年4-5月大模型的测试结果。当前的leaderboard榜单里的大模型感觉有点过时了，请问贵团队有计划测试24年最新的一批大模型吗？

dercaft

enhancement

[Feature] 请问每个任务的分是怎么计算的呢？比如OS任务中得到的只是一个准确率，但是在论文中Table3每个任务对应的都是分数，这中间的映射过程我在文中并没有找到，可以提示一下吗

1

**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd...

lonerFarea

enhancement

Fix typo in README.md

The correct abbreviation is "kg". It stands for "knowledge graph". The "kd" is a typo.

petrgazarov

**Describe the bug** A clear and concise description of what the bug is. 请问这几个在Ubuntu上是不是有问题？ Docker能启动，可是GPT4/GPT3全失败了 longinyu/agentbench-ltp longinyu/agentbench-mind2web longinyu/agentbench-card_game longinyu/agentbench-alfworld Error for cg as below: {"index": 9, "error": null, "info": null,...

ibingzhaoi

bug

help wanted

增加对Cluade3的评测

2

很好的一项工作，最近 Anthropic 也放出了 Claude3 系列模型，能否增加对 Claude3 系列模型的评测结果呢？

xqun3

enhancement

请问支持使用openai的tool_call接口进行测试吗？

1

Maybewuss

enhancement

Excellent Job! Well, no offense, it seems LLM-Bench rather than AgentBench in essence.

1

Sorry to raise the problem but give no systematic analysis It may be about to take me more time on more complete investigation over the "compression" ability of LLM as...

Konisberg

enhancement

[Bug/Assistance] mind2web的unknown是怎么回事？

1

**Describe the bug** 我在使用gpt-3.5-turbo复现AgentBench中的mind2web(m2w)时，注意到有35%的结果为unknown，在`runs.jsonl`中，这35%的unknown结果没有任何的输出。原以为是自己的问题，但注意到我复现出的分数与论文中的分数近乎一致（原论文20分，我的23分），所以这应该是AgentBench本身的问题，希望作者能修复这个unknown。 **Screenshots or Terminal Copy&Paste** ![image](https://github.com/THUDM/AgentBench/assets/28804414/98e00fe5-4f45-4295-ab7f-20070c38f422) ![image](https://github.com/THUDM/AgentBench/assets/28804414/ebfc8432-4318-4d58-a1b5-1b72dd281cc3) ![image](https://github.com/THUDM/AgentBench/assets/28804414/d8b726f0-6cc5-49f4-834f-7b264099c8ba) **Desktop (please complete the following information):** - OS: windows11 + WSL2(Ubuntu) + Docker

Tangent-90C

bug

help wanted

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper)

4

**Bug / Assistance Description** The results that are reported in the HH column are very different to the ReAct paper. In particular, ReAct reports **To Reproduce** See screenshots below. Your...

ai-nikolai

bug

help wanted

OS std 测试集结果

1

测试的os 结果文件中，几乎都没有“commit” 类别的结果，如果使用bash的能够正常执行结束作为回答正确的标准，很难保证能够是正确回答了原始的问题比如下面的情况 ![image](https://github.com/THUDM/AgentBench/assets/9492425/2e43ca02-6dc8-4a71-b5ce-aa5b61e057dc) ### 原始问题 As a student, you are given a directory named `log_files` containing log files from multiple servers. The log files are named as "server1.log", "server2.log",...

xqun3

bug

help wanted

AgentBench
AgentBench copied to clipboard

Metadata

Would llama3 wizardlm2 and other latest models be tested and published in leaderboard? 请求添加llama3 wizardlm等24年4-5月大模型的测试结果

[Feature] 请问每个任务的分是怎么计算的呢？比如OS任务中得到的只是一个准确率，但是在论文中Table3每个任务对应的都是分数，这中间的映射过程我在文中并没有找到，可以提示一下吗

Fix typo in README.md

[Bug/Assistance]

增加对Cluade3的评测

请问支持使用openai的tool_call接口进行测试吗？

Excellent Job! Well, no offense, it seems LLM-Bench rather than AgentBench in essence.

[Bug/Assistance] mind2web的unknown是怎么回事？

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper)

OS std 测试集结果

← Metadata

Owner

Metadata

AgentBench AgentBench copied to clipboard

Metadata

← Metadata

Owner

Metadata

AgentBench
AgentBench copied to clipboard