AgentBench
AgentBench copied to clipboard
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
DBBench user全部回复"1049 (42000): Unknown database 'xxx'" eg.{"index": 299, "error": null, "info": null, "output": {"index": 299, "status": "completed", "result": {"answer": "1049 (42000): Unknown database 'Team Information'", "type": "UPDATE", "error": ""}
data:image/s3,"s3://crabby-images/e9711/e9711b31480644603ea6a5edc32e1c99c8d768b4" alt="image" data:image/s3,"s3://crabby-images/9cb24/9cb24a746d7be6c519b8f59604768015d7744c49" alt="image" data:image/s3,"s3://crabby-images/a1175/a11754782929e5c8011cd13011d7d2976b1e5da3" alt="image" data:image/s3,"s3://crabby-images/11106/11106a134667be89bf0474ca83fa1cd020a11045" alt="image" 如图,我在原版alfworld train中采样了几个sample作为train集,并配置了相关参数,但出来的结果都是unknown,我可以通过什么方法访问到alfworld的训练集么?(同时webshop的训练集有什么方法可以访问么?比如下图设定中修改什么参数?) data:image/s3,"s3://crabby-images/0178d/0178d9ea46ec3b5c2d9d88eb3ca86ee8e4f7e46e" alt="image"
I looked into one particular DbBench task. GPT4 seems to have give the right answer but MD5 doesn't match. Steps to reproduce the behavior: 1. Run a task with line...
您好,我在复现的时候出现和[Issue 64](https://github.com/THUDM/AgentBench/issues/63)相似的问题。我尝试了所有的task,除了cg任务外其他都可以正常运行。其中值得注意的是,ltp任务需要大量的时间才可以运行完一条数据(在我的环境里大约是10min),所以很容易让人觉得ltp任务也不能正常运行。ltp任务的task server后台是一直都有交互信息的,但cg任务的task server后台没有任何交互信息出现。两个任务都会出现`Warning: gpt-3.5-turbo-0613/cg-dev#11 failed with error START_FAILED {"detail":"Error: Worker not responding\n"} None ` 我使用的是chatGPT3.5的API,并将并发量都设成了1。以下是我配置信息和相关的截图: ### default.yaml ``` import: definition.yaml concurrency: task: cg-dev: 1 agent: gpt-3.5-turbo-0613: 1 assignments: #...
运行的是dbbench-std任务,worker数量5。开源模型都来自Huggingface,用fastchat部署 | 使用模型 | 实际分数 | Leaderboard分数 | | - | - | - | | gpt-3.5-turbo-0613 | 37.667 | 15.00 | | llama2-13b-chat | 25.00 | 4.50 | |...
**Describe the bug** The official website fails to jump when I switch the link options. **To Reproduce** Steps to reproduce the behavior: 1. Go to https://llmbench.ai/safety/data 2. Click on AgentBench...
Hi, I have counted the number of data samples or problems in the 'os_interaction' folder, and my count shows a total of 191 samples. However, the table that provides statistics...
I want to evaluate the [vicuna_7b_v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) with the webshop task, and according to the `configs/agents/fastchat_client.yaml` the agent config is setted as following: ``` module: "src.agents.FastChatAgent" parameters: controller_address: "http://localhost:5000" max_new_tokens: 128...
We select the minimum r such that count of all tokens in (u0, ar, ur+1, · · · , uk) is not greater than 3500. ``` cn 1. 为什么是3500而不是其他数字? 2....
Added starting the container to the failure modes with an error message if either the init or start scripts fail. Before this change there were the following problems: - if...