Problem with `answer_relevancy` Metric in Ragas Framework
GitHub Issue Title:
Problem with answer_relevancy Metric in Ragas Framework
Description:
Hello,
I am experiencing an issue with the Ragas evaluation framework when using the answer_relevancy metric. Other metrics work fine, but the answer_relevancy metric causes an error. Below is a snippet of the code I am using and the corresponding error message.
Code:
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
answer_similarity,
answer_correctness
)
result = evaluate(
dataset = dataset,
llm = llm,
embeddings = embeddings,
metrics = [
faithfulness,
answer_relevancy,
context_recall,
context_precision,
answer_similarity,
answer_correctness
],
)
result
Error Message:
BadRequestError: Error code: 400 - {'object': 'error', 'message': 'best_of must be 1 when using greedy sampling.Got 3.', 'type': 'BadRequestError', 'param': None, 'code': 400}
It seems that the error is related to a BadRequestError with a message indicating that best_of must be 1 when using greedy sampling. I have verified that I am not explicitly setting the best_of parameter.
Any guidance on how to resolve this issue would be greatly appreciated.
Thank you!
Hello, @huangxuyh! I'm here to help you with any bugs, questions, or contributions you have.
To resolve the BadRequestError related to the best_of parameter when using the answer_relevancy metric in the Ragas evaluation framework, you need to ensure that the best_of parameter is not set when calling the generate method of the LLM. The generate method in the _ascore function should be modified to remove any reference to the best_of parameter.
Here is the relevant part of the code with the necessary modification:
async def _ascore(self, row: t.Dict, callbacks: Callbacks, is_async: bool) -> float:
assert self.llm is not set
prompt = self._create_question_gen_prompt(row)
result = await self.llm.generate(
prompt,
n=self.strictness,
callbacks=callbacks,
is_async=is_async,
)
answers = [
await _output_parser.aparse(result.text, prompt, self.llm)
for result in result.generations[0]
]
if any(answer is None for answer in answers):
return np.nan
answers = [answer for answer in answers if answer is not None]
return self._calculate_score(answers, row)
Ensure that the generate method does not include the best_of parameter, as shown above [1].
To continue talking to Dosu, mention @dosu.
@huangxuyh which model are you using?
@huangxuyh which model are you using? I am using the QWen2 7B Instrct model, and I have wrapped it with the vLL framework to create an OpenAI-compatible interface.
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
inference_server_url = "http://localhost:8000/v1"
MODEL = "/root/.cache/modelscope/hub/qwen/Qwen2-1___5B-Instruct"
chat = ChatOpenAI(
model=MODEL,
openai_api_key="token-abc123",
openai_api_base=inference_server_url,
max_tokens=2048,
temperature=0,
)
7B model won't be good enough for evaluation sadly. Currently the JSON following capabilities of smaller models with the prompts are tricky
Hey I am also getting this same issue when trying to run answer_relevancy using Llama 3 8B Instruct. Any help would be greatly appreciated!!
Hey I am also getting this same issue when trying to run answer_relevancy using Llama 3 8B Instruct. Any help would be greatly appreciated!!
I went back and looked at VLLM docs and I was able to find out that n is being set as 3 by ragas and that causes the best_of value to be 3. If your use_beam_search is false and temperature is 0 then you will also reach this scenario.
I used the extra_body argument in my ChatOpenAI object to set the parameters to overcome this error.