awesome-ai-eval
awesome-ai-eval copied to clipboard
☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.
Awesome AI Eval 
A curated list of tools, methods & platforms for evaluating AI quality in real applications.
A curated list of tools, frameworks, benchmarks, and observability platforms for evaluating LLMs, RAG pipelines, and autonomous agents to minimize hallucinations & evaluate practical performance in real production environments.
Contents
- Tools
- Evaluators and Test Harnesses
- RAG and Retrieval
- Prompt Evaluation & Safety
- Datasets and Methodology
- Platforms
- Open Source Platforms
- Hosted Platforms
- Cloud Platforms
- Benchmarks
- General
- Domain
- Agent
- Safety
- Leaderboards
- Resources
- Guides & Training
- Examples
- Related Collections
- Licensing
Tools
Evaluators and Test Harnesses
Core Frameworks
- Anthropic Model Evals
- Anthropic's evaluation suite for safety, capabilities, and alignment testing of language models.
- ColossalEval
- Unified pipeline for classic metrics plus GPT-assisted scoring across public datasets.
- DeepEval
- Python unit-test style metrics for hallucination, relevance, toxicity, and bias.
- Hugging Face lighteval
- Toolkit powering HF leaderboards with 1k+ tasks and pluggable metrics.
- Inspect AI
- UK AI Safety Institute framework for scripted eval plans, tool calls, and model-graded rubrics.
- MLflow Evaluators
- Eval API that logs LLM scores next to classic experiment tracking runs.
- OpenAI Evals
- Reference harness plus registry spanning reasoning, extraction, and safety evals.
- OpenCompass
- Research harness with CascadeEvaluator, CompassRank syncing, and LLM-as-judge utilities.
- Prompt Flow
- Flow builder with built-in evaluation DAGs, dataset runners, and CI hooks.
- Promptfoo
- Local-first CLI and dashboard for evaluating prompts, RAG flows, and agents with cost tracking and regression detection.
- Ragas
- Evaluation library that grades answers, context, and grounding with pluggable scorers.
- TruLens
- Feedback function framework for chains and agents with customizable judge models.
- W&B Weave Evaluations
- Managed evaluation orchestrator with dataset versioning and dashboards.
- ZenML
- Pipeline framework that bakes evaluation steps and guardrail metrics into LLM workflows.
Application and Agent Harnesses
- Braintrust
- Hosted evaluation workspace with CI-style regression tests, agent sandboxes, and token cost tracking.
- LangSmith
- Hosted tracing plus datasets, batched evals, and regression gating for LangChain apps.
- W&B Prompt Registry
- Prompt evaluation templates with reproducible scoring and reviews.
RAG and Retrieval
RAG Frameworks
- EvalScope RAG
- Guides and templates that extend Ragas-style metrics with domain rubrics.
- LlamaIndex Evaluation
- Modules for replaying queries, scoring retrievers, and comparing query engines.
- Open RAG Eval
- Vectara harness with pluggable datasets for comparing retrievers and prompts.
- RAGEval
- Framework that auto-generates corpora, questions, and RAG rubrics for completeness.
- R-Eval
- Toolkit for robust RAG scoring aligned with the Evaluation of RAG survey taxonomy.
Retrieval Benchmarks
- BEIR
- Benchmark suite covering dense, sparse, and hybrid retrieval tasks.
- ColBERT
- Late-interaction dense retriever with evaluation scripts for IR datasets.
- MTEB
- Embeddings benchmark measuring retrieval, reranking, and similarity quality.
RAG Datasets and Surveys
- Awesome-RAG-Evaluation
- Curated catalog of RAG evaluation metrics, datasets, and leaderboards.
- Comparing LLMs on Real-World Retrieval
- Empirical analysis of how language models perform on practical retrieval tasks.
- RAG Evaluation Survey
- Comprehensive paper covering metrics, judgments, and open problems for RAG.
- RAGTruth
- Human-annotated dataset for measuring hallucinations and faithfulness in RAG answers.
Prompt Evaluation & Safety
- AlpacaEval
- Automated instruction-following evaluator with length-controlled LLM judge scoring.
- ChainForge
- Visual IDE for comparing prompts, sampling models, and scoring batches with rubrics.
- Guardrails AI
- Declarative validation framework that enforces schemas, correction chains, and judgments.
- Lakera Guard
- Hosted prompt security platform with red-team datasets for jailbreak and injection testing.
- PromptBench
- Benchmark suite for adversarial prompt stress tests across diverse tasks.
- Red Teaming Handbook
- Microsoft playbook for adversarial prompt testing and mitigation patterns.
Datasets and Methodology
- Deepchecks Evaluation Playbook
- Survey of evaluation metrics, failure modes, and platform comparisons.
- HELM
- Holistic Evaluation of Language Models methodology emphasizing multi-criteria scoring.
- Instruction-Following Evaluation (IFEval)
- Constraint-verification prompts for automatically checking instruction compliance.
- OpenAI Cookbook Evals
- Practical notebooks showing how to build custom evals.
- Safety Evaluation Guides
- Cloud vendor recipes for testing quality, safety, and risk.
- Who Validates the Validators?
- EvalGen workflow aligning LLM judges with human rubrics via mixed-initiative criteria design.
- ZenML Evaluation Playbook
- Playbook for embedding eval gates into pipelines and deployments.
Platforms
Open Source Platforms
- Agenta
- End-to-end LLM developer platform for prompt engineering, evaluation, and deployment.
- Arize Phoenix
- OpenTelemetry-native observability and evaluation toolkit for RAG, LLMs, and agents.
- DocETL
- ETL system for complex document processing with LLMs and built-in quality checks.
- Giskard
- Testing framework for ML models with vulnerability scanning and LLM-specific detectors.
- Helicone
- Open-source LLM observability platform with cost tracking, caching, and evaluation tools.
- Langfuse
- Open-source LLM engineering platform providing tracing, eval dashboards, and prompt analytics.
- Lilac
- Data curation tool for exploring and enriching datasets with semantic search and clustering.
- LiteLLM
- Unified API for 100+ LLM providers with cost tracking, fallbacks, and load balancing.
- Lunary
- Production toolkit for LLM apps with tracing, prompt management, and evaluation pipelines.
- Mirascope
- Python toolkit for building LLM applications with structured outputs and evaluation utilities.
- OpenLIT
- Telemetry instrumentation for LLM apps with built-in quality metrics and guardrail hooks.
- OpenLLMetry
- OpenTelemetry instrumentation for LLM traces that feed any backend or custom eval logic.
- Opik
- Self-hostable evaluation and observability hub with datasets, scoring jobs, and interactive traces.
- Rhesis
- Collaborative testing platform with automated test generation and multi-turn conversation simulation for LLM and agentic applications.
- traceAI
- Open-source multi-modal tracing and diagnostics framework for LLM, RAG, and agent workflows built on OpenTelemetry.
- UpTrain
- OSS/hosted evaluation suite with 20+ checks, RCA tooling, and LlamaIndex integrations.
- VoltAgent
- TypeScript agent framework paired with VoltOps for trace inspection and regression testing.
- Zeno
- Data-centric evaluation UI for slicing failures, comparing prompts, and debugging retrieval quality.
Hosted Platforms
- ChatIntel
- Conversation analytics platform for evaluating chatbot quality, sentiment, and user satisfaction.
- Confident AI
- DeepEval-backed platform for scheduled eval suites, guardrails, and production monitors.
- Datadog LLM Observability
- Datadog module capturing LLM traces, metrics, and safety signals.
- Deepchecks LLM Evaluation
- Managed eval suites with dataset versioning, dashboards, and alerting.
- Eppo
- Experimentation platform with AI-specific evaluation metrics and statistical rigor for LLM A/B testing.
- Future AGI
- Multi-modal evaluation, simulation, and optimization platform for reliable AI systems across software and hardware.
- Galileo
- Evaluation and data-curation studio with labeling, slicing, and issue triage.
- HoneyHive
- Evaluation and observability platform with prompt versioning, A/B testing, and fine-tuning workflows.
- Humanloop
- Production prompt management with human-in-the-loop evals and annotation queues.
- Maxim AI
- Evaluation and observability platform focusing on agent simulations and monitoring.
- Orq.ai
- LLM operations platform with prompt management, evaluation workflows, and deployment pipelines.
- PostHog LLM Analytics
- Product analytics toolkit extended to track custom LLM events and metrics.
- PromptLayer
- Prompt engineering platform with version control, evaluation tracking, and team collaboration.
Cloud Platforms
- Amazon Bedrock Evaluations
- Managed service for scoring foundation models and RAG pipelines.
- Amazon Bedrock Guardrails
- Safety layer that evaluates prompts and responses for policy compliance.
- Azure AI Foundry Evaluations
- Evaluation flows and risk reports wired into Prompt Flow projects.
- Vertex AI Generative AI Evaluation
- Adaptive rubric-based evaluation for Google and third-party models.
Benchmarks
General
- AGIEval
- Human-centric standardized exams spanning entrance tests, legal, and math scenarios.
- BIG-bench
- Collaborative benchmark probing reasoning, commonsense, and long-tail tasks.
- CommonGen-Eval
- GPT-4 judged CommonGen-lite suite for constrained commonsense text generation.
- DyVal
- Dynamic reasoning benchmark that varies difficulty and graph structure to stress models.
- LM Evaluation Harness
- Standard harness for scoring autoregressive models on dozens of tasks.
- LLM-Uncertainty-Bench
- Adds uncertainty-aware scoring across QA, RC, inference, dialog, and summarization.
- LLMBar
- Meta-eval testing whether LLM judges can spot instruction-following failures.
- LV-Eval
- Long-context suite with five length tiers up to 256K tokens and distraction controls.
- MMLU
- Massive multitask language understanding benchmark for academic and professional subjects.
- MMLU-Pro
- Harder 10-choice extension focused on reasoning-rich, low-leakage questions.
- PertEval
- Knowledge-invariant perturbations to debias multiple-choice accuracy inflation.
Domain
- FinEval
- Chinese financial QA and reasoning benchmark across regulation, accounting, and markets.
- LAiW
- Legal benchmark covering retrieval, foundation inference, and complex case applications in Chinese law.
- HumanEval
- Unit-test-based benchmark for code synthesis and docstring reasoning.
- MATH
- Competition-level math benchmark targeting multi-step symbolic reasoning.
- MBPP
- Mostly Basic Programming Problems benchmark for small coding tasks.
Agent
- AgentBench
- Evaluates LLMs acting as agents across simulated domains like games and coding.
- GAIA
- Tool-use benchmark requiring grounded reasoning with live web access and planning.
- MetaTool Tasks
- Tool-calling benchmark and eval harness for agents built around LLaMA models.
- SuperCLUE-Agent
- Chinese agent eval covering tool use, planning, long/short-term memory, and APIs.
Safety
- AdvBench
- Adversarial prompt benchmark for jailbreak and misuse resistance measurement.
- BBQ
- Bias-sensitive QA sets measuring stereotype reliance and ambiguous cases.
- ToxiGen
- Toxic language generation and classification benchmark for robustness checks.
- TruthfulQA
- Measures factuality and hallucination propensity via adversarially written questions.
Leaderboards
- CompassRank
- OpenCompass leaderboard comparing frontier and research models across multi-domain suites.
- LLM Agents Benchmark Collections
- Aggregated leaderboard comparing multi-agent safety and reliability suites.
- Open LLM Leaderboard
- Hugging Face benchmark board with IFEval, MMLU-Pro, GPQA, and more.
- OpenAI Evals Registry
- Community suites and scores covering accuracy, safety, and instruction following.
- Scale SEAL Leaderboard
- Expert-rated leaderboard covering reasoning, coding, and safety via SEAL evaluations.
Resources
Guides & Training
- AI Evals for Engineers & PMs
- Cohort course from Hamel & Shreya with lifetime reader, Discord, AI Eval Assistant, and live office hours.
- AlignEval
- Eugene Yan's guide on building LLM judges by following methodical alignment processes.
- Applied LLMs
- Practical lessons from a year of building with LLMs, emphasizing evaluation as a core practice.
- Data Flywheels for LLM Applications
- Iterative data improvement processes for building better LLM systems.
- Error Analysis & Prioritizing Next Steps
- Andrew Ng walkthrough showing how to slice traces and focus eval work via classic ML techniques.
- Error Analysis Before Tests
- Office hours notes on why error analysis should precede writing automated tests.
- Eval Tools Comparison
- Detailed comparison of evaluation tools including Braintrust, LangSmith, and Promptfoo.
- Evals for AI Engineers
- O'Reilly book by Shreya Shankar & Hamel Husain on systematic error analysis, evaluation pipelines, and LLM-as-a-judge.
- Evaluating RAG Systems
- Practical guidance on RAG evaluation covering retrieval quality and generation assessment.
- Field Guide to Rapidly Improving AI Products
- Comprehensive guide on error analysis, data viewers, and systematic improvement from 30+ implementations.
- Inspect AI Deep Dive
- Technical deep dive into Inspect AI framework with hands-on examples.
- LLM Evals FAQ
- Comprehensive FAQ with 45+ articles covering evaluation questions from practitioners.
- LLM Evaluators Survey
- Survey of LLM-as-judge use cases and approaches with practical implementation patterns.
- LLM-as-a-Judge Guide
- In-depth guide on using LLMs as judges for automated evaluation with calibration tips.
- Mastering LLMs Open Course
- Free 40+ hour course covering evals, RAG, and fine-tuning taught by 25+ industry practitioners.
- Modern IR Evals For RAG
- Why traditional IR evals are insufficient for RAG, covering BEIR and modern approaches.
- Multi-Turn Chat Evals
- Strategies for evaluating multi-turn conversational AI systems.
- Open Source LLM Tools Comparison
- PostHog comparison of open-source LLM observability and evaluation tools.
- Scoping LLM Evals
- Case study on managing evaluation complexity through proper scoping and topic distribution.
- Why AI evals are the hottest new skill
- Lenny's interview covering error analysis, axial coding, eval prompts, and PRD alignment.
- Your AI Product Needs Evals
- Foundational article on why every AI product needs systematic evaluation.
Examples
- Arize Phoenix AI Chatbot
- Next.js chatbot with Phoenix tracing, dataset replays, and evaluation jobs.
- Azure LLM Evaluation Samples
- Prompt Flow and Azure AI Foundry projects demonstrating hosted evals.
- Deepchecks QA over CSV
- Example agent wired to Deepchecks scoring plus tracing dashboards.
- OpenAI Evals Demo Evals
- Templates for extending OpenAI Evals with custom datasets.
- Promptfoo Examples
- Ready-made prompt regression suites for RAG, summarization, and agents.
- ZenML Projects
- End-to-end pipelines showing how to weave evaluation steps into LLMOps stacks.
Related Collections
- Awesome ChainForge
- Ecosystem list centered on ChainForge experiments and extensions.
- Awesome-LLM-Eval
- Cross-lingual (Chinese) compendium of eval tooling, papers, datasets, and leaderboards.
- Awesome LLMOps
- Curated tooling for training, deployment, and monitoring of LLM apps.
- Awesome Machine Learning
- Language-specific ML resources that often host evaluation building blocks.
- Awesome RAG
- Broad coverage of retrieval-augmented generation techniques and tools.
- Awesome Self-Hosted
- Massive catalog of self-hostable software, including observability stacks.
- GenAI Notes
- Continuously updated notes and resources on GenAI systems, evaluation, and operations.
Licensing
Released under the CC0 1.0 Universal license.
Contributing
Contributions are welcome—please read CONTRIBUTING.md for scope, entry rules, and the pull-request checklist before submitting updates.