llm-as-a-judge topic

List llm-as-a-judge repositories
trafficstars

agenta

3.3k
Stars
393
Forks
Watchers

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

CodeUltraFeedback

72
Stars
5
Forks
72
Watchers

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

LLM-IR-Bias-Fairness-Survey

58
Stars
3
Forks
Watchers

This is the repo for the survey of Bias and Fairness in IR with LLMs.

Timo

24
Stars
2
Forks
Watchers

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

prometheus-eval

1.0k
Stars
63
Forks
Watchers

Evaluate your LLM's response with Prometheus and GPT4 💯

cobbler

21
Stars
2
Forks
Watchers

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

MJ-Bench

49
Stars
5
Forks
49
Watchers

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

agent-as-a-judge

662
Stars
96
Forks
Watchers

👩‍⚖️ Coding Agent-as-a-Judge

dingo

539
Stars
58
Forks
Watchers

Dingo: A Comprehensive AI Data Quality Evaluation Tool

verdict

307
Stars
22
Forks
Watchers

Inference-time scaling for LLMs-as-a-judge.