AgentBench
AgentBench copied to clipboard
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
If the agent puts out a command like 'while true; do ls /root; sleep 1; done' it will loop while also putting out an output (meaning the socket doesn't timeout)...
Currently the output parsing from the terminal breaks when it first sees a escape symbol however it appends the whole package received from the socket, which does not necessarily correspond...
**Describe the bug** A clear and concise description of what the bug is. **To Reproduce** Steps to reproduce the behavior: 1. Go to '...' 2. Click on '....' 3. Scroll...
您好,我在DBbench-std遇到问题无法连接MySQL,其它task运行正常。 我已经按照要求安装相关依赖并docker pull mysql一切正常, 只在运行python -m src.start_task -a会报错,dbbench-std 和 os-std可以执行但是dbbench-std结果不正常。(都是{"role": "agent", "content": "Action: Answer\nFinal Answer: []"}) 感谢您。 报错详情: INFO: Started server process [738313] INFO: Waiting for application startup. INFO: Application...
Hi there, Thank you for the great contributions! There have been many new models released since the benchmark was published. Do you have any plans to include some of these...
This issue is related to a [previous one](https://github.com/THUDM/AgentBench/issues/29). For the Knowledge Graph task, the agent seems to be providing the correct answer but the feedback insists that the answer is...
**Describe the bug** Could you please upload the dockerfile? That would mean a lot! **To Reproduce** None **Screenshots or Terminal Copy&Paste** None **Desktop (please complete the following information):** None **Additional...
**Describe the bug** A large number of the os-std tasks in the 7/bootstrap.json are impossible for the agents to do as the refer to a "given folder" which is at...
您好,我使用fastchat进行加载chatglm3-6b模型, step1 `python3 -m fastchat.serve.controller` step2 `python3 -m fastchat.serve.model_worker --model-path /ldata/llms/chatglm3-6b` step3 `python3 -m fastchat.serve.openai_api_server --host 10.0.1.227 --port 30008` 启动服务后,我修改了fs_agents.yaml文件,内容为 ``` default: module: "src.client.agents.FastChatAgent" parameters: name: "FastChat" controller_address: "http://10.0.1.227:30008" max_new_tokens:...
the reference answer doesn't following the description