ragas icon indicating copy to clipboard operation
ragas copied to clipboard

UQLM-Ragas integration

Open dylanbouchard opened this issue 4 months ago • 0 comments

Summary Integrate UQLM’s response-level confidence scoring (bounded from 0 to 1) into Ragas via new metrics that calls UQLM BlackBoxUQ scorers and/or WhiteBoxUQ when token-level logprobs are available. These metrics could support per-scorer confidences and an ensemble confidence, with an optional hallucination risk view as 1 - confidence. This would complement faithfulness and relevancy by providing an effective ground-truth-free signal that is model-agnostic.

Motivation

  • Many evaluation scenarios lack ground truth or rely on small human-labeled sets. These uncertainty-based metrics help flag risky answers without ground truth.
  • UQLM offers a suite of uncertainty scorers that are production-friendly and model-agnostic. Integrating UQLM would give Ragas users a principled hallucination risk score alongside existing metrics.
  • Users can correlate uncertainty/confidence with faithfulness to triage failure modes and choose mitigation thresholds.

About UQLM

BlackBoxUQ overview Black-box UQ computes response-level confidence scores by measuring consistency in multiple responses generated from the same prompt. The workflow is displayed below:

Image

UQLM's black-box workflow can be used in two ways. First, responses can be generated and scored simultaneously:

from uqlm import BlackBoxUQ
bbuq = BlackBoxUQ(llm=llm, scorers=["noncontradiction"])
results = await bbuq.generate_and_score(
    prompts=prompts, # list of prompts
    num_responses=5 # indicates how many sampled responses to generate per prompt
)

Second, responses can be scored from pre-generated responses:

bbuq = BlackBoxUQ(scorers=["noncontradiction"])
results = bbuq.score(
    responses=responses, # List of LLM responses
    sampled_responses=sampled_responses,  # List of lists, each with a set of sampled responses to the same prompt
)

Available black-box scorers include:

WhiteBoxUQ overview

  • White-box UQ computes response-level confidence scores by measuring consistency in multiple responses generated from the same prompt. The workflow is displayed below:
Image

White box methods do generation and scoring simultaneously:

from uqlm import WhiteBoxUQ
wbuq = WhiteBoxUQ(llm=llm, scorers=["min_probability"])
results = await wbuq.generate_and_score(prompts=prompts)

Available white-box scorers include:

Proposed feature Introduce a new set of metrics (or possibly single metric), exposed in ragas.metrics.experimental (or similar), which returns:

  • uqlm_confidence float in [0, 1], higher means more likely to be correct, OR uqlm_risk float in [0, 1], higher means more likely to contain hallucination (1-confidence). Perhaps this can specified by the user.
  • uqlm_high_risk bool when threshold is provided.

TBD for reviewer discussion

  • Dependency strategy: direct dependency vs optional extra (for example ragas[uqlm]).
  • Class design: single class for all scorers vs scorer-specific classes vs separate BlackBox and WhiteBox classes. Preference is to keep BlackBox and WhiteBox separate.
  • Naming and default scorer set for the initial release.

Checklist

  • [ ] Metric classes (polished and well-tested)
  • [ ] Dependency (optional?)
  • [ ] Documentation page
  • [ ] Example notebook
  • [ ] Unit and integration tests

Additional context I am the author of UQLM and can help with the API design, implementation, and maintenance on the UQLM side. Happy to open a PR once aligned with Ragas priorities.

dylanbouchard avatar Aug 13 '25 19:08 dylanbouchard