ai-evaluation topic
List
ai-evaluation repositories
trafficstars
vivaria
53
Stars
15
Forks
Watchers
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
confabulations
236
Stars
7
Forks
236
Watchers
Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.
deception
31
Stars
2
Forks
31
Watchers
Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...
kereva-scanner
75
Stars
7
Forks
75
Watchers
Code scanner to check for issues in prompts and LLM calls