ai-benchmark topic
List
ai-benchmark repositories
WindowsAgentArena
805
Stars
88
Forks
805
Watchers
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
awesome-ai-agent-testing
21
Stars
4
Forks
21
Watchers
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
TheAgentCompany
612
Stars
98
Forks
612
Watchers
An agent benchmark with tasks in a simulated software company.
AI_Diplomacy
611
Stars
85
Forks
611
Watchers
Frontier Models playing the board game Diplomacy.
agent-leaderboard
205
Stars
23
Forks
205
Watchers
Ranking LLMs on agentic tasks