agent-evaluation topic

List agent-evaluation repositories

ai-agents-reality-check

51
Stars
0
Forks
51
Watchers

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible...

coze-loop

5.2k
Stars
705
Forks
5.2k
Watchers

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to...

awesome-ai-agent-testing

21
Stars
4
Forks
21
Watchers

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

eval-view

22
Stars
3
Forks
22
Watchers

EvalView: pytest-style test harness for AI agents - YAML scenarios, tool-call checks, cost/latency & safety evals, CI-friendly reports

agent-leaderboard

205
Stars
23
Forks
205
Watchers

Ranking LLMs on agentic tasks

Learn How To Observe, Manage, and Scale, Agentic AI Apps Using Azure AI Foundry - with this hands-on workshop