AgentBench
AgentBench copied to clipboard
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
请求添加llama3 wizardlm等24年4-5月大模型的测试结果。 当前的leaderboard榜单里的大模型感觉有点过时了,请问贵团队有计划测试24年最新的一批大模型吗?
**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd...
The correct abbreviation is "kg". It stands for "knowledge graph". The "kd" is a typo.
**Describe the bug** A clear and concise description of what the bug is. 请问这几个在Ubuntu上是不是有问题? Docker能启动,可是GPT4/GPT3全失败了 longinyu/agentbench-ltp longinyu/agentbench-mind2web longinyu/agentbench-card_game longinyu/agentbench-alfworld Error for cg as below: {"index": 9, "error": null, "info": null,...
Sorry to raise the problem but give no systematic analysis It may be about to take me more time on more complete investigation over the "compression" ability of LLM as...
**Describe the bug** 我在使用gpt-3.5-turbo复现AgentBench中的mind2web(m2w)时,注意到有35%的结果为unknown,在`runs.jsonl`中,这35%的unknown结果没有任何的输出。 原以为是自己的问题,但注意到我复现出的分数与论文中的分数近乎一致(原论文20分,我的23分),所以这应该是AgentBench本身的问题,希望作者能修复这个unknown。 **Screenshots or Terminal Copy&Paste**    **Desktop (please complete the following information):** - OS: windows11 + WSL2(Ubuntu) + Docker
**Bug / Assistance Description** The results that are reported in the HH column are very different to the ReAct paper. In particular, ReAct reports **To Reproduce** See screenshots below. Your...
测试的os 结果文件中,几乎都没有“commit” 类别的结果,如果使用bash的能够正常执行结束作为回答正确的标准,很难保证能够是正确回答了原始的问题比如下面的情况  ### 原始问题 As a student, you are given a directory named `log_files` containing log files from multiple servers. The log files are named as "server1.log", "server2.log",...