AgentBench icon indicating copy to clipboard operation
AgentBench copied to clipboard

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Results 46 AgentBench issues
Sort by recently updated
recently updated
newest added

请求添加llama3 wizardlm等24年4-5月大模型的测试结果。 当前的leaderboard榜单里的大模型感觉有点过时了,请问贵团队有计划测试24年最新的一批大模型吗?

enhancement

**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd...

enhancement

The correct abbreviation is "kg". It stands for "knowledge graph". The "kd" is a typo.

**Describe the bug** A clear and concise description of what the bug is. 请问这几个在Ubuntu上是不是有问题? Docker能启动,可是GPT4/GPT3全失败了 longinyu/agentbench-ltp longinyu/agentbench-mind2web longinyu/agentbench-card_game longinyu/agentbench-alfworld Error for cg as below: {"index": 9, "error": null, "info": null,...

bug
help wanted

很好的一项工作,最近 Anthropic 也放出了 Claude3 系列模型,能否增加对 Claude3 系列模型的评测结果呢?

enhancement

Sorry to raise the problem but give no systematic analysis It may be about to take me more time on more complete investigation over the "compression" ability of LLM as...

enhancement

**Describe the bug** 我在使用gpt-3.5-turbo复现AgentBench中的mind2web(m2w)时,注意到有35%的结果为unknown,在`runs.jsonl`中,这35%的unknown结果没有任何的输出。 原以为是自己的问题,但注意到我复现出的分数与论文中的分数近乎一致(原论文20分,我的23分),所以这应该是AgentBench本身的问题,希望作者能修复这个unknown。 **Screenshots or Terminal Copy&Paste** ![image](https://github.com/THUDM/AgentBench/assets/28804414/98e00fe5-4f45-4295-ab7f-20070c38f422) ![image](https://github.com/THUDM/AgentBench/assets/28804414/ebfc8432-4318-4d58-a1b5-1b72dd281cc3) ![image](https://github.com/THUDM/AgentBench/assets/28804414/d8b726f0-6cc5-49f4-834f-7b264099c8ba) **Desktop (please complete the following information):** - OS: windows11 + WSL2(Ubuntu) + Docker

bug
help wanted

**Bug / Assistance Description** The results that are reported in the HH column are very different to the ReAct paper. In particular, ReAct reports **To Reproduce** See screenshots below. Your...

bug
help wanted

测试的os 结果文件中,几乎都没有“commit” 类别的结果,如果使用bash的能够正常执行结束作为回答正确的标准,很难保证能够是正确回答了原始的问题比如下面的情况 ![image](https://github.com/THUDM/AgentBench/assets/9492425/2e43ca02-6dc8-4a71-b5ce-aa5b61e057dc) ### 原始问题 As a student, you are given a directory named `log_files` containing log files from multiple servers. The log files are named as "server1.log", "server2.log",...

bug
help wanted