ragas Answer Relevancy giving same questions everytime

@dosu I've gone through the library and first of all, the prompt needs to have something that instructs to go through the context, rather than just providing in few shot. Second, I streamed the questions through your help but all three questions are exactly same, and that is happening for every row. Maybe that is why the score is inconsistent, help me fix it please.

Aug 13 '24 06:08 knpunk

To address the issues with the Answer Relevancy feature generating the same questions every time and causing inconsistent scoring, you can consider the following steps:

Ensure Randomness in Question Generation: Modify the QUESTION_GEN prompt to include a variety of examples and instructions that encourage diverse question generation.
Check Embedding Consistency: Ensure that the embeddings used for calculating similarity are consistent and correctly implemented.
Adjust Strictness Parameter: The strictness parameter controls the number of questions generated per answer. Ensure it is set to an appropriate value (e.g., 3 by default).

Here is a modified version of the relevant parts of the code to address these issues:

QUESTION_GEN = Prompt(
    name="question_generation",
    instruction="""Generate a diverse set of questions for the given answer and identify if the answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers. Ensure the questions are varied and cover different aspects of the context.""",
    output_format_instruction=_output_instructions,
    examples=[
        {
            "answer": """Albert Einstein was born in Germany.""",
            "context": """Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time""",
            "output": AnswerRelevanceClassification.parse_obj(
                {
                    "question": "Where was Albert Einstein born?",
                    "noncommittal": 0,
                }
            ).dict(),
        },
        {
            "answer": """It can change its skin color based on the temperature of its environment.""",
            "context": """A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment.""",
            "output": AnswerRelevanceClassification.parse_obj(
                {
                    "question": "What unique ability does the newly discovered species of frog have?",
                    "noncommittal": 0,
                }
            ).dict(),
        },
        {
            "answer": """Everest""",
            "context": """The tallest mountain on Earth, measured from sea level, is a renowned peak located in the Himalayas.""",
            "output": AnswerRelevanceClassification.parse_obj(
                {
                    "question": "What is the tallest mountain on Earth?",
                    "noncommittal": 0,
                }
            ).dict(),
        },
        {
            "answer": """I don't know about the groundbreaking feature of the smartphone invented in 2023 as I am unaware of information beyond 2022.""",
            "context": """In 2023, a groundbreaking invention was announced: a smartphone with a battery life of one month, revolutionizing the way people use mobile technology.""",
            "output": AnswerRelevanceClassification.parse_obj(
                {
                    "question": "What was the groundbreaking feature of the smartphone invented in 2023?",
                    "noncommittal": 1,
                }
            ).dict(),
        },
    ],
    input_keys=["answer", "context"],
    output_key="output",
    output_type="json",
)

@dataclass
class AnswerRelevancy(MetricWithLLM, MetricWithEmbeddings):
    name: str = "answer_relevancy"
    evaluation_mode: EvaluationMode = EvaluationMode.qac
    question_generation: Prompt = field(default_factory=lambda: QUESTION_GEN)
    strictness: int = 3  # Adjust this value as needed

    def calculate_similarity(
        self: t.Self, question: str, generated_questions: list[str]
    ):
        assert self.embeddings is not None
        question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
        gen_question_vec = np.asarray(
            self.embeddings.embed_documents(generated_questions)
        ).reshape(len(generated_questions), -1)
        norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
            question_vec, axis=1
        )
        return (
            np.dot(gen_question_vec, question_vec.T).reshape(
                -1,
            )
            / norm
        )

    def _calculate_score(
        self, answers: t.Sequence[AnswerRelevanceClassification], row: t.Dict
    ) -> float:
        question = row["question"]
        gen_questions = [answer.question for answer in answers]
        committal = np.any([answer.noncommittal for answer in answers])
        if all(q == "" for q in gen_questions):
            logger.warning(
                "Invalid JSON response. Expected dictionary with key 'question'"
            )
            score = np.nan
        else:
            cosine_sim = self.calculate_similarity(question, gen_questions)
            score = cosine_sim.mean() * int(not committal)

        return score

    def _create_question_gen_prompt(self, row: t.Dict) -> PromptValue:
        ans, ctx = row["answer"], row["contexts"]
        return self.question_generation.format(answer=ans, context="\n".join(ctx))

    async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
        assert self.llm is not None, "LLM is not set"

        prompt = self._create_question_gen_prompt(row)
        result = await self.llm.generate(
            prompt,
            n=self.strictness,
            callbacks=callbacks,
        )

        answers = [
            await _output_parser.aparse(result.text, prompt, self.llm)
            for result in result.generations[0]
        ]
        if any(answer is None for answer in answers):
            return np.nan

        answers = [answer for answer in answers if answer is not None]
        return self._calculate_score(answers, row)

These changes should help in generating more diverse questions and ensuring consistent scoring. Additionally, the Answer Relevancy metric is defined as the mean cosine similarity of the original question to a number of artificial questions generated based on the answer. This ensures that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question ^[1]^[2].

To continue talking to Dosu, mention @dosu.

Aug 13 '24 06:08 dosubot[bot]

@dosu I did the changes, but still a same set of questions, what's the point of strictness?

Aug 13 '24 06:08 knpunk

The strictness parameter in Ragas is used to control the number of questions generated per answer, typically set between 2 to 4. It helps maintain a certain level of self-consistency in predictions, ensuring that the evaluations are consistent and reliable ^[1].

If you are still getting the same set of questions despite adjusting the strictness parameter, you might need to ensure that the prompts and examples used for question generation are diverse and comprehensive. Here are a few additional steps you can take:

Enhance Prompt Diversity: Ensure that the prompt used for question generation includes a variety of examples and instructions that encourage diverse question generation. You can use different types of prompts available in Ragas, such as conditional_question_prompt, compress_question_prompt, and reasoning_question_prompt ^[2].
Use Evolutionary Techniques: Utilize the TestsetGenerator with different types of question generation techniques like simple, reasoning, and multi_context to ensure a diverse set of questions ^[3].
Check Embedding Consistency: Ensure that the embeddings used for calculating similarity are consistent and correctly implemented. This can affect the diversity of the generated questions.

Here is an example of how to use the TestsetGenerator to generate diverse questions:

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load your documents
# documents = ...

# Initialize generator with OpenAI models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Define question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

# Generate test set with diverse questions
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
testset.to_pandas()

By ensuring prompt diversity and using evolutionary techniques, you can generate a more varied set of questions, leading to more consistent and reliable evaluations.

To continue talking to Dosu, mention @dosu.

Aug 13 '24 06:08 dosubot[bot]

@dosu sorry, but it doesn't help, except it's increasing the amount of questions even more in some cases

Aug 13 '24 07:08 knpunk

To address the issue of generating the same set of questions despite increasing the strictness parameter, you can try the following steps:

Enhance Prompt Instructions: Ensure that the prompt instructions explicitly encourage diversity in the generated questions. You can modify the QUESTION_GEN prompt to include instructions for generating varied questions.
Use Different Prompts: Utilize different types of prompts available in Ragas, such as conditional_question_prompt, compress_question_prompt, and reasoning_question_prompt, to generate a diverse set of questions.
Random Seed for Generation: Introduce randomness in the question generation process by setting a random seed for the LLM. This can help in generating different questions each time.
Check Embedding Consistency: Ensure that the embeddings used for calculating similarity are consistent and correctly implemented. This can affect the diversity of the generated questions.

Here is an example of how to modify the QUESTION_GEN prompt and introduce randomness:

from ragas.llms.prompt import Prompt
import random

QUESTION_GEN = Prompt(
    name="question_generation",
    instruction="""Generate a diverse set of questions for the given answer and identify if the answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers. Ensure the questions are varied and cover different aspects of the context.""",
    examples=[
        {
            "answer": "Albert Einstein was born in Germany.",
            "context": "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.",
            "output": {"question": "Where was Albert Einstein born?", "noncommittal": 0},
        },
        {
            "answer": "It can change its skin color based on the temperature of its environment.",
            "context": "A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment.",
            "output": {"question": "What unique ability does the newly discovered species of frog have?", "noncommittal": 0},
        },
        {
            "answer": "Everest",
            "context": "The tallest mountain on Earth, measured from sea level, is a renowned peak located in the Himalayas.",
            "output": {"question": "What is the tallest mountain on Earth?", "noncommittal": 0},
        },
        {
            "answer": "I don't know about the groundbreaking feature of the smartphone invented in 2023 as I am unaware of information beyond 2022.",
            "context": "In 2023, a groundbreaking invention was announced: a smartphone with a battery life of one month, revolutionizing the way people use mobile technology.",
            "output": {"question": "What was the groundbreaking feature of the smartphone invented in 2023?", "noncommittal": 1},
        },
    ],
    input_keys=["answer", "context"],
    output_key="output",
    output_type="json",
)

@dataclass
class AnswerRelevancy(MetricWithLLM, MetricWithEmbeddings):
    name: str = "answer_relevancy"
    evaluation_mode: EvaluationMode = EvaluationMode.qac
    question_generation: Prompt = field(default_factory=lambda: QUESTION_GEN)
    strictness: int = 3  # Adjust this value as needed

    def calculate_similarity(self, question: str, generated_questions: list[str]):
        assert self.embeddings is not None
        question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
        gen_question_vec = np.asarray(self.embeddings.embed_documents(generated_questions)).reshape(len(generated_questions), -1)
        norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(question_vec, axis=1)
        return np.dot(gen_question_vec, question_vec.T).reshape(-1,) / norm

    def _calculate_score(self, answers: t.Sequence[AnswerRelevanceClassification], row: t.Dict) -> float:
        question = row["question"]
        gen_questions = [answer.question for answer in answers]
        committal = np.any([answer.noncommittal for answer in answers])
        if all(q == "" for q in gen_questions):
            logger.warning("Invalid JSON response. Expected dictionary with key 'question'")
            score = np.nan
        else:
            cosine_sim = self.calculate_similarity(question, gen_questions)
            score = cosine_sim.mean() * int(not committal)
        return score

    def _create_question_gen_prompt(self, row: t.Dict) -> PromptValue:
        ans, ctx = row["answer"], row["contexts"]
        return self.question_generation.format(answer=ans, context="\n".join(ctx))

    async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
        assert self.llm is not None, "LLM is not set"
        random.seed()  # Introduce randomness
        prompt = self._create_question_gen_prompt(row)
        result = await self.llm.generate(prompt, n=self.strictness, callbacks=callbacks)
        answers = [await _output_parser.aparse(result.text, prompt, self.llm) for result in result.generations[0]]
        if any(answer is None for answer in answers):
            return np.nan
        answers = [answer for answer in answers if answer is not None]
        return self._calculate_score(answers, row)

By enhancing the prompt instructions, using different types of prompts, and introducing randomness, you should be able to generate a more diverse set of questions ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

Aug 13 '24 07:08 dosubot[bot]

probably @jjmachan or @shahules786 can help this? I also facing the same issue. I am using model claude-3-5-sonnet-20241022, it always return the same question for the same input. I propose that if we set strictness > 1 we must pass the old generated question to tell LLM that don't generate the same question. I know this is tricky.

Or, just tell the LLM to generate 3 different question at one time, this not only make the LLM generate the different output but also will reducing cost because we don't need to call it every time.

Feb 18 '25 16:02 yusufsyaifudin

I'm also facing the same issue (getting the same question repeated instead of getting different questions). @jjmachan @shahules786 any suggestion on how to solve this?

Mar 20 '25 12:03 aabanic

I do agree with @yusufsyaifudin, why not asking the llm 3 questions in a single call ? (I am currently facing the same issue where the judge llm always provide the same question each time)

Sep 08 '25 14:09 spineau59

@spineau59 btw, currently I mitigated this my setting the temperature into 1 and set bypass_temperature=True

llm: BaseChatModel = None # use some LangChain model here
evaluator_llm: BaseRagasLLM = LangchainLLMWrapper(langchain_llm=llm, bypass_temperature=True)

because when I debug every calls in Ragas and Langchain, I see that it set temperature into 0.3 when n>1 which causing the same input will generate same output, references:

https://github.com/explodinggradients/ragas/blob/v0.3.3/src/ragas/llms/base.py#L71-L73
https://github.com/explodinggradients/ragas/blob/v0.3.3/src/ragas/prompt/pydantic_prompt.py#L220-L226
https://github.com/explodinggradients/ragas/blob/v0.3.3/src/ragas/metrics/_answer_relevance.py#L142-L144

I don't know if the iteration up-to N is by design to prevent the LLM hallucination, i.e first call may return an outlier "statement A", and others return group of similar statements that really different from statement A:

statement A -> about brand logo
statement B -> fruit name
statement C -> genus name of plant

Sep 10 '25 17:09 yusufsyaifudin