ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Problem with `answer_relevancy` Metric in Ragas Framework

Open huangxuyh opened this issue 1 year ago • 6 comments

GitHub Issue Title:

Problem with answer_relevancy Metric in Ragas Framework

Description:

Hello,

I am experiencing an issue with the Ragas evaluation framework when using the answer_relevancy metric. Other metrics work fine, but the answer_relevancy metric causes an error. Below is a snippet of the code I am using and the corresponding error message.

Code:

from ragas import evaluate
from ragas.metrics import (
   faithfulness,
   answer_relevancy,
   context_recall,
   context_precision,
   answer_similarity,
   answer_correctness
)

result = evaluate(
    dataset = dataset,
    llm = llm,
    embeddings = embeddings,
    metrics = [
        faithfulness,
        answer_relevancy,
        context_recall,
        context_precision,
        answer_similarity,
        answer_correctness
    ],
)
result

Error Message:

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'best_of must be 1 when using greedy sampling.Got 3.', 'type': 'BadRequestError', 'param': None, 'code': 400}

It seems that the error is related to a BadRequestError with a message indicating that best_of must be 1 when using greedy sampling. I have verified that I am not explicitly setting the best_of parameter.

Any guidance on how to resolve this issue would be greatly appreciated.

Thank you!

huangxuyh avatar Aug 03 '24 09:08 huangxuyh

Hello, @huangxuyh! I'm here to help you with any bugs, questions, or contributions you have.

To resolve the BadRequestError related to the best_of parameter when using the answer_relevancy metric in the Ragas evaluation framework, you need to ensure that the best_of parameter is not set when calling the generate method of the LLM. The generate method in the _ascore function should be modified to remove any reference to the best_of parameter.

Here is the relevant part of the code with the necessary modification:

async def _ascore(self, row: t.Dict, callbacks: Callbacks, is_async: bool) -> float:
    assert self.llm is not set

    prompt = self._create_question_gen_prompt(row)
    result = await self.llm.generate(
        prompt,
        n=self.strictness,
        callbacks=callbacks,
        is_async=is_async,
    )

    answers = [
        await _output_parser.aparse(result.text, prompt, self.llm)
        for result in result.generations[0]
    ]
    if any(answer is None for answer in answers):
        return np.nan

    answers = [answer for answer in answers if answer is not None]
    return self._calculate_score(answers, row)

Ensure that the generate method does not include the best_of parameter, as shown above [1].

To continue talking to Dosu, mention @dosu.

dosubot[bot] avatar Aug 03 '24 09:08 dosubot[bot]

@huangxuyh which model are you using?

jjmachan avatar Aug 05 '24 14:08 jjmachan

@huangxuyh which model are you using? I am using the QWen2 7B Instrct model, and I have wrapped it with the vLL framework to create an OpenAI-compatible interface.

from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

inference_server_url = "http://localhost:8000/v1"
MODEL = "/root/.cache/modelscope/hub/qwen/Qwen2-1___5B-Instruct"
chat = ChatOpenAI(
    model=MODEL,
    openai_api_key="token-abc123",
    openai_api_base=inference_server_url,
    max_tokens=2048,
    temperature=0,
)

huangxuyh avatar Aug 05 '24 22:08 huangxuyh

7B model won't be good enough for evaluation sadly. Currently the JSON following capabilities of smaller models with the prompts are tricky

jjmachan avatar Aug 06 '24 05:08 jjmachan

Hey I am also getting this same issue when trying to run answer_relevancy using Llama 3 8B Instruct. Any help would be greatly appreciated!!

joelreji avatar Sep 13 '24 14:09 joelreji

Hey I am also getting this same issue when trying to run answer_relevancy using Llama 3 8B Instruct. Any help would be greatly appreciated!!

I went back and looked at VLLM docs and I was able to find out that n is being set as 3 by ragas and that causes the best_of value to be 3. If your use_beam_search is false and temperature is 0 then you will also reach this scenario.

I used the extra_body argument in my ChatOpenAI object to set the parameters to overcome this error.

joelreji avatar Sep 19 '24 18:09 joelreji