llm-as-a-judge topics

xFinder

176

Stars

7

Forks

Watchers

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

IAAR-Shanghai

evaluation

gpt

llm

xfinder

xVerify

138

Stars

7

Forks

Watchers

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

IAAR-Shanghai

benchmark

cc-by-nc-nd-4

chatgpt

deepseek-math

ineqmath

52

Stars

7

Forks

Watchers

Solving Inequality Proofs with Large Language Models.

lupantech

inequality

llm-as-a-judge

llms

math-reasoning

circle-guard-bench

44

Stars

2

Forks

Watchers

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

whitecircle-ai

ai

benchmark

benchmarking

guardrail

docling-sdg

35

Stars

13

Forks

Watchers

A set of tools to create synthetically-generated data from documents

docling-project

ai

documents

llm-as-a-judge

question-answering

OmniVerifier

34

Stars

3

Forks

Watchers

Generative Universal Verifier as Multimodal Meta-Reasoner

Cominclip

llm-as-a-judge

multimodal-large-language-models

multimodal-reasoning

vision-language-model

Themis

20

Stars

1

Forks

Watchers

The official repository for our EMNLP 2024 paper, Themis: A Reference-free NLG Evaluation Language Model with Flexibility and Interpretability.

PKU-ONELab

evaluation

llm-as-a-judge

nlg

CuREV

18

Stars

3

Forks

18

Watchers

Harnessing Large Language Models for Curated Code Reviews

OussamaSghaier

code-review

dataset-curation

empirical-software-engineering

large-language-models