AgentBench icon indicating copy to clipboard operation
AgentBench copied to clipboard

A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)

Results 46 AgentBench issues
Sort by recently updated
recently updated
newest added

DBBench user全部回复"1049 (42000): Unknown database 'xxx'" eg.{"index": 299, "error": null, "info": null, "output": {"index": 299, "status": "completed", "result": {"answer": "1049 (42000): Unknown database 'Team Information'", "type": "UPDATE", "error": ""}

bug
help wanted

![image](https://github.com/THUDM/AgentBench/assets/77482343/ef449feb-b9a6-4ac2-af47-d47c07f177ad) ![image](https://github.com/THUDM/AgentBench/assets/77482343/1255f420-7bc2-41a9-b77e-8d3b0d8fc5f8) ![image](https://github.com/THUDM/AgentBench/assets/77482343/e51508d5-e30f-4f65-9310-cea6b4b097f1) ![image](https://github.com/THUDM/AgentBench/assets/77482343/1d7465bd-19d1-4910-92bc-cd92b7e86f74) 如图,我在原版alfworld train中采样了几个sample作为train集,并配置了相关参数,但出来的结果都是unknown,我可以通过什么方法访问到alfworld的训练集么?(同时webshop的训练集有什么方法可以访问么?比如下图设定中修改什么参数?) ![image](https://github.com/THUDM/AgentBench/assets/77482343/33d4a195-e663-45bb-89af-d48e3c6c0a43)

bug
help wanted

I looked into one particular DbBench task. GPT4 seems to have give the right answer but MD5 doesn't match. Steps to reproduce the behavior: 1. Run a task with line...

bug
help wanted

您好,我在复现的时候出现和[Issue 64](https://github.com/THUDM/AgentBench/issues/63)相似的问题。我尝试了所有的task,除了cg任务外其他都可以正常运行。其中值得注意的是,ltp任务需要大量的时间才可以运行完一条数据(在我的环境里大约是10min),所以很容易让人觉得ltp任务也不能正常运行。ltp任务的task server后台是一直都有交互信息的,但cg任务的task server后台没有任何交互信息出现。两个任务都会出现`Warning: gpt-3.5-turbo-0613/cg-dev#11 failed with error START_FAILED {"detail":"Error: Worker not responding\n"} None ` 我使用的是chatGPT3.5的API,并将并发量都设成了1。以下是我配置信息和相关的截图: ### default.yaml ``` import: definition.yaml concurrency: task: cg-dev: 1 agent: gpt-3.5-turbo-0613: 1 assignments: #...

bug
help wanted

运行的是dbbench-std任务,worker数量5。开源模型都来自Huggingface,用fastchat部署 | 使用模型 | 实际分数 | Leaderboard分数 | | - | - | - | | gpt-3.5-turbo-0613 | 37.667 | 15.00 | | llama2-13b-chat | 25.00 | 4.50 | |...

bug
help wanted

**Describe the bug** The official website fails to jump when I switch the link options. **To Reproduce** Steps to reproduce the behavior: 1. Go to https://llmbench.ai/safety/data 2. Click on AgentBench...

bug
help wanted

Hi, I have counted the number of data samples or problems in the 'os_interaction' folder, and my count shows a total of 191 samples. However, the table that provides statistics...

bug
help wanted

I want to evaluate the [vicuna_7b_v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) with the webshop task, and according to the `configs/agents/fastchat_client.yaml` the agent config is setted as following: ``` module: "src.agents.FastChatAgent" parameters: controller_address: "http://localhost:5000" max_new_tokens: 128...

We select the minimum r such that count of all tokens in (u0, ar, ur+1, · · · , uk) is not greater than 3500. ``` cn 1. 为什么是3500而不是其他数字? 2....

Added starting the container to the failure modes with an error message if either the init or start scripts fail. Before this change there were the following problems: - if...