AgentBench
AgentBench copied to clipboard
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
**Title**: Revise Prompts to Comply with OpenAI API Policy **Description**: ### Background Recent updates to the OpenAI API have introduced stricter content filtering policies, causing some of our existing prompts...
Hi AgentBench Team, Thanks for your awesome effort in constructing this benchmark. I would like to ask have you or plan to add the experimental results of large reasoning models...
I am trying to run the webshop-std but it shows that the task does not exist. May I ask why it will happen?   Following is my config:...
In data/os_interaction/data/dev.json, the example code for task "Find out count of linux users on this system who belong to at least 4 groups." is incorrect. The current example checks for...
**Describe the bug** A clear and concise description of what the bug is. In the code, Following code is used to check whether the input string is an entity: ```python...
I want to view the UI like the demo video. Does anyone know how i can do this?
Does anyone run into 100 error on the docker build? ``` docker build -f data/os_interaction/res/dockerfiles/default data/os_interaction/res/dockerfiles --tag local-os/default ``` ``` 1.987 At least one invalid signature was encountered. 2.082 Get:3...