UQLM-Ragas integration

Open dylanbouchard opened this issue 4 months ago • 0 comments

Summary Integrate UQLM’s response-level confidence scoring (bounded from 0 to 1) into Ragas via new metrics that calls UQLM BlackBoxUQ scorers and/or WhiteBoxUQ when token-level logprobs are available. These metrics could support per-scorer confidences and an ensemble confidence, with an optional hallucination risk view as 1 - confidence. This would complement faithfulness and relevancy by providing an effective ground-truth-free signal that is model-agnostic.

Motivation

Many evaluation scenarios lack ground truth or rely on small human-labeled sets. These uncertainty-based metrics help flag risky answers without ground truth.
UQLM offers a suite of uncertainty scorers that are production-friendly and model-agnostic. Integrating UQLM would give Ragas users a principled hallucination risk score alongside existing metrics.
Users can correlate uncertainty/confidence with faithfulness to triage failure modes and choose mitigation thresholds.

About UQLM

Repo: https://github.com/cvs-health/uqlm
Papers:
- Approach overview and experiments: https://arxiv.org/abs/2504.19254
- Software overview: https://arxiv.org/abs/2507.06196
Highlights: multiple scorer families, simple Python API, supports black-box and white-box signals, ensembles.

BlackBoxUQ overview Black-box UQ computes response-level confidence scores by measuring consistency in multiple responses generated from the same prompt. The workflow is displayed below:

UQLM's black-box workflow can be used in two ways. First, responses can be generated and scored simultaneously:

from uqlm import BlackBoxUQ
bbuq = BlackBoxUQ(llm=llm, scorers=["noncontradiction"])
results = await bbuq.generate_and_score(
    prompts=prompts, # list of prompts
    num_responses=5 # indicates how many sampled responses to generate per prompt
)

Second, responses can be scored from pre-generated responses:

bbuq = BlackBoxUQ(scorers=["noncontradiction"])
results = bbuq.score(
    responses=responses, # List of LLM responses
    sampled_responses=sampled_responses,  # List of lists, each with a set of sampled responses to the same prompt
)

Available black-box scorers include:

Non-Contradiction Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
Discrete Semantic Entropy (Farquhar et al., 2024; Bouchard & Chauhan, 2025)
Exact Match (Cole et al., 2023; Chen & Mueller, 2023)
BERT-score (Manakul et al., 2023; Zheng et al., 2020)
Cosine Similarity (Shorinwa et al., 2024; HuggingFace)

WhiteBoxUQ overview

White-box UQ computes response-level confidence scores by measuring consistency in multiple responses generated from the same prompt. The workflow is displayed below:

White box methods do generation and scoring simultaneously:

from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])
results = await wbuq.generate_and_score(prompts=prompts)

Available white-box scorers include:

Minimum token probability (Manakul et al., 2023)
Length-Normalized Joint Token Probability (Malinin & Gales, 2021)

Proposed feature Introduce a new set of metrics (or possibly single metric), exposed in ragas.metrics.experimental (or similar), which returns:

uqlm_confidence float in [0, 1], higher means more likely to be correct, OR uqlm_risk float in [0, 1], higher means more likely to contain hallucination (1-confidence). Perhaps this can specified by the user.
uqlm_high_risk bool when threshold is provided.

TBD for reviewer discussion

Dependency strategy: direct dependency vs optional extra (for example ragas[uqlm]).
Class design: single class for all scorers vs scorer-specific classes vs separate BlackBox and WhiteBox classes. Preference is to keep BlackBox and WhiteBox separate.
Naming and default scorer set for the initial release.

Checklist

[ ] Metric classes (polished and well-tested)
[ ] Dependency (optional?)
[ ] Documentation page
[ ] Example notebook
[ ] Unit and integration tests

Additional context I am the author of UQLM and can help with the API design, implementation, and maintenance on the UQLM side. Happy to open a PR once aligned with Ragas priorities.

Aug 13 '25 19:08 dylanbouchard