llm-as-a-judge topic
agenta
The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.
CodeUltraFeedback
CodeUltraFeedback: aligning large language models to coding preferences (TOSEM 2025)
LLM-IR-Bias-Fairness-Survey
This is the repo for the survey of Bias and Fairness in IR with LLMs.
Timo
Code and data for "Timo: Towards Better Temporal Reasoning for Language Models" (COLM 2024)
prometheus-eval
Evaluate your LLM's response with Prometheus and GPT4 💯
cobbler
Code and data for Koo et al's ACL 2024 paper "Benchmarking Cognitive Biases in Large Language Models as Evaluators"
MJ-Bench
Official implementation for "MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?"
agent-as-a-judge
👩⚖️ Coding Agent-as-a-Judge
dingo
Dingo: A Comprehensive AI Data Quality Evaluation Tool
verdict
Inference-time scaling for LLMs-as-a-judge.