AgentBench
AgentBench copied to clipboard
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
**Describe the bug** 每次运行到cg-std#14 assigner报错 > Warning: Qwen2-72B-Instruct/cg-std#14 failed with error INTERACT_FAILED {"detail":"Error: Worker not responding\n"} index=None status= result=None history=None start_task进程中打印 > except sending except message sent 然后任务运行完也会只有error.jsonl结果 看了下大概是这个位置开始报错 https://github.com/THUDM/AgentBench/blob/57b982b10f782661b1346b2234c5ed463f6f85c3/src/server/tasks/card_game/server.py#L38...
你好,请问对于各个任务,有release相应的trajectories吗?包括human的和LLM的。 在文章里貌似没有找到呢。 谢谢。
我已经能正常运行dbbench的任务,并且在output中正常输出结果。想请教一下如何实现demo视频中实时观察agent在终端操作数据库的效果呢?
**Is your feature request related to a problem? Please describe.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] **Describe the solution you'd...

{ "index": 298, "error": null, "info": null, "output": { "index": 298, "status": "completed", "result": { "answer": "1049 (42000): Unknown database 'team_stadiums'", "type": "UPDATE", "error": "" }, "history": [ { "role":...
如果我想测试其它家国内模型,我改怎么修改配置?
A simple strategy to support multiple API keys and distribute calls evenly in http_agent.
**Is your feature request related to a problem? Please describe.** 使用集群时无法使用docker,请问是否可以直接本地下载环境而非使用docker