ai-benchmark topic

List ai-benchmark repositories

WindowsAgentArena

805
Stars
88
Forks
805
Watchers

Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.

awesome-ai-agent-testing

21
Stars
4
Forks
21
Watchers

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

TheAgentCompany

612
Stars
98
Forks
612
Watchers

An agent benchmark with tasks in a simulated software company.

AI_Diplomacy

611
Stars
85
Forks
611
Watchers

Frontier Models playing the board game Diplomacy.

agent-leaderboard

205
Stars
23
Forks
205
Watchers

Ranking LLMs on agentic tasks