ai-evaluation topic

List ai-evaluation repositories

vivaria

53
Stars
15
Forks
Watchers

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

confabulations

240
Stars
7
Forks
240
Watchers

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

deception

31
Stars
2
Forks
31
Watchers

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...

kereva-scanner

76
Stars
7
Forks
76
Watchers

Code scanner to check for issues in prompts and LLM calls

uqlm

1.1k
Stars
115
Forks
1.1k
Watchers

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

cookbooks

21
Stars
1
Forks
21
Watchers

Example Projects integrated with Future AGI Tech Stack for easy AI development

agent-leaderboard

205
Stars
23
Forks
205
Watchers

Ranking LLMs on agentic tasks

awesome-ai-eval

22
Stars
4
Forks
22
Watchers

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.

deepscholar-bench

103
Stars
10
Forks
103
Watchers

benchmark and evaluate generative research synthesis