Answer Relevancy giving same questions everytime
@dosu I've gone through the library and first of all, the prompt needs to have something that instructs to go through the context, rather than just providing in few shot. Second, I streamed the questions through your help but all three questions are exactly same, and that is happening for every row. Maybe that is why the score is inconsistent, help me fix it please.
To address the issues with the Answer Relevancy feature generating the same questions every time and causing inconsistent scoring, you can consider the following steps:
-
Ensure Randomness in Question Generation: Modify the
QUESTION_GENprompt to include a variety of examples and instructions that encourage diverse question generation. -
Check Embedding Consistency: Ensure that the embeddings used for calculating similarity are consistent and correctly implemented.
-
Adjust Strictness Parameter: The
strictnessparameter controls the number of questions generated per answer. Ensure it is set to an appropriate value (e.g., 3 by default).
Here is a modified version of the relevant parts of the code to address these issues:
QUESTION_GEN = Prompt(
name="question_generation",
instruction="""Generate a diverse set of questions for the given answer and identify if the answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers. Ensure the questions are varied and cover different aspects of the context.""",
output_format_instruction=_output_instructions,
examples=[
{
"answer": """Albert Einstein was born in Germany.""",
"context": """Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time""",
"output": AnswerRelevanceClassification.parse_obj(
{
"question": "Where was Albert Einstein born?",
"noncommittal": 0,
}
).dict(),
},
{
"answer": """It can change its skin color based on the temperature of its environment.""",
"context": """A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment.""",
"output": AnswerRelevanceClassification.parse_obj(
{
"question": "What unique ability does the newly discovered species of frog have?",
"noncommittal": 0,
}
).dict(),
},
{
"answer": """Everest""",
"context": """The tallest mountain on Earth, measured from sea level, is a renowned peak located in the Himalayas.""",
"output": AnswerRelevanceClassification.parse_obj(
{
"question": "What is the tallest mountain on Earth?",
"noncommittal": 0,
}
).dict(),
},
{
"answer": """I don't know about the groundbreaking feature of the smartphone invented in 2023 as I am unaware of information beyond 2022.""",
"context": """In 2023, a groundbreaking invention was announced: a smartphone with a battery life of one month, revolutionizing the way people use mobile technology.""",
"output": AnswerRelevanceClassification.parse_obj(
{
"question": "What was the groundbreaking feature of the smartphone invented in 2023?",
"noncommittal": 1,
}
).dict(),
},
],
input_keys=["answer", "context"],
output_key="output",
output_type="json",
)
@dataclass
class AnswerRelevancy(MetricWithLLM, MetricWithEmbeddings):
name: str = "answer_relevancy"
evaluation_mode: EvaluationMode = EvaluationMode.qac
question_generation: Prompt = field(default_factory=lambda: QUESTION_GEN)
strictness: int = 3 # Adjust this value as needed
def calculate_similarity(
self: t.Self, question: str, generated_questions: list[str]
):
assert self.embeddings is not None
question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
gen_question_vec = np.asarray(
self.embeddings.embed_documents(generated_questions)
).reshape(len(generated_questions), -1)
norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
question_vec, axis=1
)
return (
np.dot(gen_question_vec, question_vec.T).reshape(
-1,
)
/ norm
)
def _calculate_score(
self, answers: t.Sequence[AnswerRelevanceClassification], row: t.Dict
) -> float:
question = row["question"]
gen_questions = [answer.question for answer in answers]
committal = np.any([answer.noncommittal for answer in answers])
if all(q == "" for q in gen_questions):
logger.warning(
"Invalid JSON response. Expected dictionary with key 'question'"
)
score = np.nan
else:
cosine_sim = self.calculate_similarity(question, gen_questions)
score = cosine_sim.mean() * int(not committal)
return score
def _create_question_gen_prompt(self, row: t.Dict) -> PromptValue:
ans, ctx = row["answer"], row["contexts"]
return self.question_generation.format(answer=ans, context="\n".join(ctx))
async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
assert self.llm is not None, "LLM is not set"
prompt = self._create_question_gen_prompt(row)
result = await self.llm.generate(
prompt,
n=self.strictness,
callbacks=callbacks,
)
answers = [
await _output_parser.aparse(result.text, prompt, self.llm)
for result in result.generations[0]
]
if any(answer is None for answer in answers):
return np.nan
answers = [answer for answer in answers if answer is not None]
return self._calculate_score(answers, row)
These changes should help in generating more diverse questions and ensuring consistent scoring. Additionally, the Answer Relevancy metric is defined as the mean cosine similarity of the original question to a number of artificial questions generated based on the answer. This ensures that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question [1][2].
To continue talking to Dosu, mention @dosu.
@dosu I did the changes, but still a same set of questions, what's the point of strictness?
The strictness parameter in Ragas is used to control the number of questions generated per answer, typically set between 2 to 4. It helps maintain a certain level of self-consistency in predictions, ensuring that the evaluations are consistent and reliable [1].
If you are still getting the same set of questions despite adjusting the strictness parameter, you might need to ensure that the prompts and examples used for question generation are diverse and comprehensive. Here are a few additional steps you can take:
-
Enhance Prompt Diversity: Ensure that the prompt used for question generation includes a variety of examples and instructions that encourage diverse question generation. You can use different types of prompts available in Ragas, such as
conditional_question_prompt,compress_question_prompt, andreasoning_question_prompt[2]. -
Use Evolutionary Techniques: Utilize the
TestsetGeneratorwith different types of question generation techniques likesimple,reasoning, andmulti_contextto ensure a diverse set of questions [3]. -
Check Embedding Consistency: Ensure that the embeddings used for calculating similarity are consistent and correctly implemented. This can affect the diversity of the generated questions.
Here is an example of how to use the TestsetGenerator to generate diverse questions:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Load your documents
# documents = ...
# Initialize generator with OpenAI models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4")
embeddings = OpenAIEmbeddings()
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
# Define question type distribution
distributions = {
simple: 0.5,
multi_context: 0.4,
reasoning: 0.1
}
# Generate test set with diverse questions
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
testset.to_pandas()
By ensuring prompt diversity and using evolutionary techniques, you can generate a more varied set of questions, leading to more consistent and reliable evaluations.
To continue talking to Dosu, mention @dosu.
@dosu sorry, but it doesn't help, except it's increasing the amount of questions even more in some cases
To address the issue of generating the same set of questions despite increasing the strictness parameter, you can try the following steps:
-
Enhance Prompt Instructions: Ensure that the prompt instructions explicitly encourage diversity in the generated questions. You can modify the
QUESTION_GENprompt to include instructions for generating varied questions. -
Use Different Prompts: Utilize different types of prompts available in Ragas, such as
conditional_question_prompt,compress_question_prompt, andreasoning_question_prompt, to generate a diverse set of questions. -
Random Seed for Generation: Introduce randomness in the question generation process by setting a random seed for the LLM. This can help in generating different questions each time.
-
Check Embedding Consistency: Ensure that the embeddings used for calculating similarity are consistent and correctly implemented. This can affect the diversity of the generated questions.
Here is an example of how to modify the QUESTION_GEN prompt and introduce randomness:
from ragas.llms.prompt import Prompt
import random
QUESTION_GEN = Prompt(
name="question_generation",
instruction="""Generate a diverse set of questions for the given answer and identify if the answer is noncommittal. Give noncommittal as 1 if the answer is noncommittal and 0 if the answer is committal. A noncommittal answer is one that is evasive, vague, or ambiguous. For example, "I don't know" or "I'm not sure" are noncommittal answers. Ensure the questions are varied and cover different aspects of the context.""",
examples=[
{
"answer": "Albert Einstein was born in Germany.",
"context": "Albert Einstein was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time.",
"output": {"question": "Where was Albert Einstein born?", "noncommittal": 0},
},
{
"answer": "It can change its skin color based on the temperature of its environment.",
"context": "A recent scientific study has discovered a new species of frog in the Amazon rainforest that has the unique ability to change its skin color based on the temperature of its environment.",
"output": {"question": "What unique ability does the newly discovered species of frog have?", "noncommittal": 0},
},
{
"answer": "Everest",
"context": "The tallest mountain on Earth, measured from sea level, is a renowned peak located in the Himalayas.",
"output": {"question": "What is the tallest mountain on Earth?", "noncommittal": 0},
},
{
"answer": "I don't know about the groundbreaking feature of the smartphone invented in 2023 as I am unaware of information beyond 2022.",
"context": "In 2023, a groundbreaking invention was announced: a smartphone with a battery life of one month, revolutionizing the way people use mobile technology.",
"output": {"question": "What was the groundbreaking feature of the smartphone invented in 2023?", "noncommittal": 1},
},
],
input_keys=["answer", "context"],
output_key="output",
output_type="json",
)
@dataclass
class AnswerRelevancy(MetricWithLLM, MetricWithEmbeddings):
name: str = "answer_relevancy"
evaluation_mode: EvaluationMode = EvaluationMode.qac
question_generation: Prompt = field(default_factory=lambda: QUESTION_GEN)
strictness: int = 3 # Adjust this value as needed
def calculate_similarity(self, question: str, generated_questions: list[str]):
assert self.embeddings is not None
question_vec = np.asarray(self.embeddings.embed_query(question)).reshape(1, -1)
gen_question_vec = np.asarray(self.embeddings.embed_documents(generated_questions)).reshape(len(generated_questions), -1)
norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(question_vec, axis=1)
return np.dot(gen_question_vec, question_vec.T).reshape(-1,) / norm
def _calculate_score(self, answers: t.Sequence[AnswerRelevanceClassification], row: t.Dict) -> float:
question = row["question"]
gen_questions = [answer.question for answer in answers]
committal = np.any([answer.noncommittal for answer in answers])
if all(q == "" for q in gen_questions):
logger.warning("Invalid JSON response. Expected dictionary with key 'question'")
score = np.nan
else:
cosine_sim = self.calculate_similarity(question, gen_questions)
score = cosine_sim.mean() * int(not committal)
return score
def _create_question_gen_prompt(self, row: t.Dict) -> PromptValue:
ans, ctx = row["answer"], row["contexts"]
return self.question_generation.format(answer=ans, context="\n".join(ctx))
async def _ascore(self, row: t.Dict, callbacks: Callbacks) -> float:
assert self.llm is not None, "LLM is not set"
random.seed() # Introduce randomness
prompt = self._create_question_gen_prompt(row)
result = await self.llm.generate(prompt, n=self.strictness, callbacks=callbacks)
answers = [await _output_parser.aparse(result.text, prompt, self.llm) for result in result.generations[0]]
if any(answer is None for answer in answers):
return np.nan
answers = [answer for answer in answers if answer is not None]
return self._calculate_score(answers, row)
By enhancing the prompt instructions, using different types of prompts, and introducing randomness, you should be able to generate a more diverse set of questions [1][2][3].
To continue talking to Dosu, mention @dosu.
probably @jjmachan or @shahules786 can help this? I also facing the same issue.
I am using model claude-3-5-sonnet-20241022, it always return the same question for the same input.
I propose that if we set strictness > 1 we must pass the old generated question to tell LLM that don't generate the same question. I know this is tricky.
Or, just tell the LLM to generate 3 different question at one time, this not only make the LLM generate the different output but also will reducing cost because we don't need to call it every time.
I'm also facing the same issue (getting the same question repeated instead of getting different questions). @jjmachan @shahules786 any suggestion on how to solve this?
I do agree with @yusufsyaifudin, why not asking the llm 3 questions in a single call ? (I am currently facing the same issue where the judge llm always provide the same question each time)
@spineau59 btw, currently I mitigated this my setting the temperature into 1 and set bypass_temperature=True
llm: BaseChatModel = None # use some LangChain model here
evaluator_llm: BaseRagasLLM = LangchainLLMWrapper(langchain_llm=llm, bypass_temperature=True)
because when I debug every calls in Ragas and Langchain, I see that it set temperature into 0.3 when n>1 which causing the same input will generate same output, references:
- https://github.com/explodinggradients/ragas/blob/v0.3.3/src/ragas/llms/base.py#L71-L73
- https://github.com/explodinggradients/ragas/blob/v0.3.3/src/ragas/prompt/pydantic_prompt.py#L220-L226
- https://github.com/explodinggradients/ragas/blob/v0.3.3/src/ragas/metrics/_answer_relevance.py#L142-L144
I don't know if the iteration up-to N is by design to prevent the LLM hallucination, i.e first call may return an outlier "statement A", and others return group of similar statements that really different from statement A:
- statement A -> about brand logo
- statement B -> fruit name
- statement C -> genus name of plant