unitxt Negative values when evaluating context relevance metrics

We encountered an issue where we are getting negative values when evaluating metrics.rag.context_relevance.sentence_bert_mini_lm. The expectation here is to get a value between a positive value between 0 and 1.

This was tested using the latest unitxt version 1.22.2.

Attached a minimal example to reproduce this issue:

import json

from unitxt.eval_utils import evaluate

if __name__ == "__main__":
    data1 = {
        "question": "Why are 'perturbations' simulated?",
        "contexts": [
            # "forget their previous instructions. Other sophisticated techniques could include role-playing or red-teaming interactions which can pre-condition a model into naively following harmful instructions.Advanced — more complex attacks can be crafted with specialised encodings and optimised characters including adversarial suffixes which may not have any clear meaning but are still capable of manipulating the model into responding in harmful ways.To check for prompt leakage, IBM watsonx.governance runs your prompt through several different attack scenarios. The responses generated by the model are then compared to the original prompt using a semantic similarity score. The result is a score between 0 and 1, where 0 indicates low risk and 1 indicates high risk of leakage.These tests are crucial for minimizing risks when deploying prompts into production. By computing the metrics, like,“Adversarial Robustness” and “Prompt Leakage Risk”, the LLM Application Developers can assess how susceptible",
            "into the vector store.Query the vector store with the user’s question to retrieve the top-k chunks/contexts.Use these contexts and the user’s question to construct a system prompt and run it against an LLM.Finally, retrieve the answer.Something like this:Typical RAG PipelineIn this process, LLM application developers typically face the following challenges:Selecting the LLM: Among different LLMs, determine which one provides responses that result in better RAG metrics, such as Faithfulness, Answer Relevance, and Context Relevance.Defining the system prompt: Does including varied and multiple N-shot examples result in better relevance scores and no hallucinated answers compared to using only a few N-shot examples?Tuning the prompt parameters: For instance, does a prompt temperature value of 0.6 produce better answer relevance scores and fewer hallucinations compared to a temperature value of 0.9?You can imagine various combinations for evaluating which LLM to use, what prompt string to",
            # "Prompt Leakage.Prompt Injection AttacksPrompt injection is an attack in which a malicious user tries to modify the input prompt given to an LLM in a such a way that the model either completely ignores the instructions given to it or it misinterprets the instructions given to it. You may have come across news articles about how a car dealership incorporated LLM into their chat bot and a user was able to trick it into selling him a car for $1.Prompt injection is the most significant threat to any Large language model and ranks as the number one vulnerability in OWASP’s top ten for Large language model applications, https://owasp.org/www-project-top-10-for-large-language-model-applications/Prompt Leakage AttacksPrompt Leakage is another threat to large language models where an LLM spits out the original system prompt instruction given to it. Imagine your organization investing considerable time and effort in crafting an effective prompt that efficiently answers user queries on your",
        ],
        "answer": "The provided context does not specifically mention 'perturbations', their purpose, or how they are simulated. Therefore, I cannot provide an answer based on the given information.",
        "ground_truth": "To compute metrics like fairness or explainability around real data points. ",
    }

    result, _ = evaluate(
        [data1],
        metric_names=[
            "metrics.rag.context_relevance.sentence_bert_mini_lm",
        ],
    )

    print(json.dumps(result, indent=2))

Script output:

[
  {
    "question": "Why are 'perturbations' simulated?",
    "contexts": [
      "into the vector store.Query the vector store with the user\u2019s question to retrieve the top-k chunks/contexts.Use these contexts and the user\u2019s question to construct a system prompt and run it against an LLM.Finally, retrieve the answer.Something like this:Typical RAG PipelineIn this process, LLM application developers typically face the following challenges:Selecting the LLM: Among different LLMs, determine which one provides responses that result in better RAG metrics, such as Faithfulness, Answer Relevance, and Context Relevance.Defining the system prompt: Does including varied and multiple N-shot examples result in better relevance scores and no hallucinated answers compared to using only a few N-shot examples?Tuning the prompt parameters: For instance, does a prompt temperature value of 0.6 produce better answer relevance scores and fewer hallucinations compared to a temperature value of 0.9?You can imagine various combinations for evaluating which LLM to use, what prompt string to"
    ],
    "answer": "The provided context does not specifically mention 'perturbations', their purpose, or how they are simulated. Therefore, I cannot provide an answer based on the given information.",
    "ground_truth": "To compute metrics like fairness or explainability around real data points. ",
    "metrics.rag.context_relevance.sentence_bert_mini_lm": -0.06242950260639191
  }
]

Apr 17 '25 12:04 algadhib

@assaftibm @lilacheden Do you know why is that?

Apr 18 '25 10:04 elronbandel

Thanks @algadhib for providing such a clear way to recreate.

I checked and this occurs in past versions as well (atleast 1.14.0) - at least with the current models and dependencies. So it's not related to recent Unitxt code changes.

@lilacheden @assaftibm - let's debug this today.

Apr 24 '25 06:04 yoavkatz

From what I see, the metric does not necessary return a score between 0 and 1, but instead been -1 and 1.

This is because at the core it does cos_sim between vectors.

       score = util.cos_sim(pred_emb, refs_group_emb).max().item()

Which in this case results in the -0.06.

Apr 24 '25 06:04 yoavkatz

So first at the minimum, we need to document it.

We need to consider if we want to "normalize the score" to range [0,1] by setting score = (cossin_score + 1) / 2 .

Please advise.

Apr 24 '25 06:04 yoavkatz

This issue is stale because it has been open for 30 days with no activity.

May 25 '25 03:05 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jun 08 '25 03:06 github-actions[bot]

Is it ok to just document this behavior (to avoid non backward compatible changes)? Users can decide on the needed normalization (@algadhib).

Jun 08 '25 07:06 yoavkatz

@yoavkatz Are you suggesting us to do the normalisation or will an option be provided in unitxt to get the normalised score? can you provide the operation to perform to get the normalized score?

Jun 18 '25 11:06 pratapkishorevarma

I think that the simplest approach in your code is to take max(0,score) of the score. This can be done only for 'metrics.rag.context_relevance.sentence_bert_mini_lm' or even on all scores. (Assuming negative score are never desired).

Jun 18 '25 11:06 yoavkatz

This issue is stale because it has been open for 30 days with no activity.

Jul 19 '25 03:07 github-actions[bot]

We decided to document the behavior, and not change it. This is to allow the most natural interpretation:

Computes semantic similarity using Sentence-BERT embeddings.

Range: [-1, 1] (higher is better)
Measures cosine similarity between sentence-level embeddings.

Jul 20 '25 06:07 yoavkatz