ai-evaluation topics

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation met...

lechmazur

ai-benchmarks

ai-evaluation

ai-safety

ai-security

kereva-scanner

76

Stars

7

Forks

76

Watchers

Code scanner to check for issues in prompts and LLM calls

kereva-dev

ai

ai-code-review

ai-evaluation

ai-performance

uqlm

1.1k

Stars

115

Forks

1.1k

Watchers

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

cvs-health

ai-evaluation

ai-safety

confidence-estimation

confidence-score

cookbooks

21

Stars

1

Forks

21

Watchers

Example Projects integrated with Future AGI Tech Stack for easy AI development

future-agi

agentic-ai

ai-agents

ai-evaluation

cookbooks

agent-leaderboard

205

Stars

23

Forks

205

Watchers

Ranking LLMs on agentic tasks

rungalileo

agent-evaluation

ai

ai-agents

ai-benchmark

awesome-ai-eval

22

Stars

4

Forks

22

Watchers

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications.

Vvkmnn

ai-evaluation

ai-evaluation-framework

ai-evaluation-metrics

ai-evaluation-tools

deepscholar-bench

103

Stars

10

Forks

103

Watchers

benchmark and evaluate generative research synthesis

guestrin-lab

ai-evaluation

benchmark-suite

dataset-generation

deep-research