UQLM-Ragas integration
Summary
Integrate UQLM’s response-level confidence scoring (bounded from 0 to 1) into Ragas via new metrics that calls UQLM BlackBoxUQ scorers and/or WhiteBoxUQ when token-level logprobs are available. These metrics could support per-scorer confidences and an ensemble confidence, with an optional hallucination risk view as 1 - confidence. This would complement faithfulness and relevancy by providing an effective ground-truth-free signal that is model-agnostic.
Motivation
- Many evaluation scenarios lack ground truth or rely on small human-labeled sets. These uncertainty-based metrics help flag risky answers without ground truth.
- UQLM offers a suite of uncertainty scorers that are production-friendly and model-agnostic. Integrating UQLM would give Ragas users a principled hallucination risk score alongside existing metrics.
- Users can correlate uncertainty/confidence with faithfulness to triage failure modes and choose mitigation thresholds.
About UQLM
- Repo: https://github.com/cvs-health/uqlm
- Papers:
- Approach overview and experiments: https://arxiv.org/abs/2504.19254
- Software overview: https://arxiv.org/abs/2507.06196
- Highlights: multiple scorer families, simple Python API, supports black-box and white-box signals, ensembles.
BlackBoxUQ overview
Black-box UQ computes response-level confidence scores by measuring consistency in multiple responses generated from the same prompt. The workflow is displayed below:
UQLM's black-box workflow can be used in two ways. First, responses can be generated and scored simultaneously:
from uqlm import BlackBoxUQ
bbuq = BlackBoxUQ(llm=llm, scorers=["noncontradiction"])
results = await bbuq.generate_and_score(
prompts=prompts, # list of prompts
num_responses=5 # indicates how many sampled responses to generate per prompt
)
Second, responses can be scored from pre-generated responses:
bbuq = BlackBoxUQ(scorers=["noncontradiction"])
results = bbuq.score(
responses=responses, # List of LLM responses
sampled_responses=sampled_responses, # List of lists, each with a set of sampled responses to the same prompt
)
Available black-box scorers include:
- Non-Contradiction Probability (Chen & Mueller, 2023; Lin et al., 2024; Manakul et al., 2023)
- Discrete Semantic Entropy (Farquhar et al., 2024; Bouchard & Chauhan, 2025)
- Exact Match (Cole et al., 2023; Chen & Mueller, 2023)
- BERT-score (Manakul et al., 2023; Zheng et al., 2020)
- Cosine Similarity (Shorinwa et al., 2024; HuggingFace)
WhiteBoxUQ overview
- White-box UQ computes response-level confidence scores by measuring consistency in multiple responses generated from the same prompt. The workflow is displayed below:
White box methods do generation and scoring simultaneously:
from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])
results = await wbuq.generate_and_score(prompts=prompts)
Available white-box scorers include:
- Minimum token probability (Manakul et al., 2023)
- Length-Normalized Joint Token Probability (Malinin & Gales, 2021)
Proposed feature
Introduce a new set of metrics (or possibly single metric), exposed in ragas.metrics.experimental (or similar), which returns:
- uqlm_confidence float in [0, 1], higher means more likely to be correct, OR uqlm_risk float in [0, 1], higher means more likely to contain hallucination (1-confidence). Perhaps this can specified by the user.
- uqlm_high_risk bool when threshold is provided.
TBD for reviewer discussion
- Dependency strategy: direct dependency vs optional extra (for example ragas[uqlm]).
- Class design: single class for all scorers vs scorer-specific classes vs separate BlackBox and WhiteBox classes. Preference is to keep BlackBox and WhiteBox separate.
- Naming and default scorer set for the initial release.
Checklist
- [ ] Metric classes (polished and well-tested)
- [ ] Dependency (optional?)
- [ ] Documentation page
- [ ] Example notebook
- [ ] Unit and integration tests
Additional context I am the author of UQLM and can help with the API design, implementation, and maintenance on the UQLM side. Happy to open a PR once aligned with Ragas priorities.