ai-evaluation topic
vivaria
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
confabulations
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
deception
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...
kereva-scanner
Code scanner to check for issues in prompts and LLM calls
uqlm
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
cookbooks
Example Projects integrated with Future AGI Tech Stack for easy AI development
agent-leaderboard
Ranking LLMs on agentic tasks
awesome-ai-eval
☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.
deepscholar-bench
benchmark and evaluate generative research synthesis