llm-as-a-judge topics

agenta

3.5k

Stars

413

Forks

3.5k

Watchers

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

Agenta-AI

human-annotation

langchain

large-language-models

llama-index

CodeUltraFeedback

72

Stars

5

Forks

72

Watchers

CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)

martin-wey

alignment

codal-bench

code-generation

codeultrafeedback

LLM-IR-Bias-Fairness-Survey

58

Stars

3

Forks

58

Watchers

This is the repo for the survey of Bias and Fairness in IR with LLMs.

KID-22

bias

chatgpt

fairness

information-retrieval

Timo

24

Stars

3

Forks

24

Watchers

Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)

zhaochen0110

colm2024

llm-as-a-judge

llm-as-evaluator

llms

prometheus-eval

1.0k

Stars

66

Forks

1.0k

Watchers

Evaluate your LLM's response with Prometheus and GPT4 💯

prometheus-eval

evaluation

gpt4

litellm

llm

cobbler

21

Stars

2

Forks

21

Watchers

Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"

minnesotanlp

bias

bias-detection

evaluation

llm

MJ-Bench

49

Stars

5

Forks

49

Watchers

Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"

MJ-Bench

llm-as-a-judge

llm-benchmarking

multimodal-foundation-model

multimodal-judge

agent-as-a-judge

662

Stars

96

Forks

Watchers

👩‍⚖️ Coding Agent-as-a-Judge

metauto-ai

agent-as-a-judge

llm-as-a-judge

llms

dingo

539

Stars

58

Forks

Watchers

Dingo: A Comprehensive AI Data Quality Evaluation Tool

MigoXLab

common-crawl

data-agent

data-evaluation

data-quality

verdict

307

Stars

22

Forks

Watchers

Inference-time scaling for LLMs-as-a-judge.

haizelabs

inference-time-compute

llm

llm-as-a-judge

llm-judge