ai-evaluation topic

List ai-evaluation repositories
trafficstars

vivaria

53
Stars
15
Forks
Watchers

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

confabulations

236
Stars
7
Forks
236
Watchers

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

deception

31
Stars
2
Forks
31
Watchers

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...

kereva-scanner

75
Stars
7
Forks
75
Watchers

Code scanner to check for issues in prompts and LLM calls