llama_index icon indicating copy to clipboard operation
llama_index copied to clipboard

[Question]: AnswerRelevancyEvaluator() and ContextRelevancyEvaluator() result and document are unclear.

Open qiongw opened this issue 1 year ago • 4 comments

Question Validation

  • [X] I have searched both the documentation and discord for an answer.

Question

I have some questions on AnswerRelevancyEvaluator() and ContextRelevancyEvaluator().

In the following link, it provide some description for AnswerRelevancyEvaluator() and ContextRelevancyEvaluator() https://docs.llamaindex.ai/en/stable/api_reference/evaluation.html

AnswerRelevancyEvaluator(): Answer relevancy evaluator. Evaluates the relevancy of response to a query. This evaluator considers the query string and response string.

ContextRelevancyEvaluator() Context relevancy evaluator. Evaluates the relevancy of retrieved contexts to a query. This evaluator considers the query string and retrieved contexts.

For both functions, they do not mention the the output score range.

Based on the example document below, it seems the score from ContextRelevancyEvaluator() is between 0 and 1. The score from AnswerRelevancyEvaluator() is 0, 0.5 or 1. https://docs.llamaindex.ai/en/stable/examples/evaluation/answer_and_context_relevancy.html#evaluating-answer-and-context-relevancy-separately Can anyone elaborate?

Issue with AnswerRelevancyEvaluator() and ContextRelevancyEvaluator() when using llama_index==0.9.41. See one example below when using ContextRelevancyEvaluator()

Feedback: "1. The retrieved context partially matches the subject matter of the user's query. The context talks about what the author did growing up, including writing and programming. However, it does not provide a comprehensive answer to the query as it only covers a few aspects of the author's childhood activities. (1 point)*n\n2. The retrieved context cannot be used exclusively to provide a full answer to the user's query. While it provides some information about the author's childhood activities, it does not provide a complete picture of what the author did growing up. (0 points)\n\nOverall, the retrieved context is partially relevant to the user's query but does not provide a complete answer. \n\n[RESULT] 1.0*"

Score: 0.25 Within the feedback, it contains some points. However, it is unclear what does it mean. And it is also unclear how the final score is calculated. Is score calculated based on the points in the feedback?

See one example below when using AnswerRelevancyEvaluator() EvaluationResult(query='What did the author do growing up?', contexts=None, response='Growing up, the author worked on writing and programming outside of school. They wrote short stories and tried writing programs on an IBM 1401 in 9th grade using an early version of Fortran. With microcomputers, the author started programming more and wrote simple games, a program to predict how high model rockets would fly, and a word processor that their father used to write at least one book.', passing=None, feedback="1. Yes, the provided response matches the subject matter of the user's query by discussing what the author did growing up.\n2. Yes, the provided response attempts to address the focus or perspective on the subject matter taken on by the user's query by discussing the author's activities related to writing and programming.\n\n[VALID] followed by 2", score=None, pairwise_source=None, invalid_result=True, invalid_reason='Unable to parse the output string.')

In the feedback, at the end it mentions "followed by 2", what does it mean? Why the score is None?

qiongw avatar Feb 02 '24 12:02 qiongw

🤖

Thank you for your detailed questions. I'll try to clarify the points you've raised.

The AnswerRelevancyEvaluator() and ContextRelevancyEvaluator() functions in LlamaIndex are designed to evaluate the relevancy of responses and contexts to a given query. They return an EvaluationResult object, which includes a score attribute. This score is a float value that represents the relevancy of the response or context to the query.

The exact range of the output score for these functions is between 0.0 and 1.0. The score is calculated based on whether the response or context is relevant to the query. If it is relevant (i.e., the word "yes" is in the raw response text), the score is 1.0. Otherwise, the score is 0.0.

Regarding the points in the feedback, they are part of the raw response text from the evaluation. The meaning of these points would depend on the context of the evaluation. For example, in the feedback you provided, the points seem to be used to explain why the context is partially relevant to the query.

As for the phrase "followed by 2" in the feedback, it is also part of the raw response text. Its meaning would depend on the context of the evaluation. Without more information, it's hard to say exactly what it means in this case.

The invalid_result=True and invalid_reason='Unable to parse the output string.' in your AnswerRelevancyEvaluator() example indicate that there was an error in parsing the output string. This could be due to a variety of reasons, such as an unexpected format in the output string.

I hope this helps clarify how the AnswerRelevancyEvaluator() and ContextRelevancyEvaluator() functions work in LlamaIndex. If you have any more questions, please don't hesitate to ask.

For more information, you can refer to the source code of the RelevancyEvaluator class in the LlamaIndex codebase: llama_index/evaluation/relevancy.py.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dosubot[bot] avatar Feb 02 '24 12:02 dosubot[bot]

Hi @qiongw, both AnswerRelevancyEvaluator and ContextRelevancyEvaluator provide a "rubric" grading scheme to the LLM evaluator to pass judgement and make scores. Ultimately, provided the LLM evaluator adheres to required output format, the final score will be a score living between 0 and 1 (representing the fraction of points awarded by the Judge out of the total possible points).

Note, the rubrics for AnswerRelevacnyEvaluator and ContextRelevancyEvaluator are different and indeed have differing total number of points that can be awarded, AnswerRelevancy -> 2 Total Possible Points; ContextRelevancy -> 4 Total Possible Points.

The "rubrics" (i.e. DEFAULT_EVAL_TEMPLATE) can be viewed here:

Again, the final score is then computed by the formula: ( the total number of points awarded by the LLM ) / ( total possible points )

As for your second example, I don't know why the LLM would respond with "[VALID] followed by 2". And in so doing, our output parser fails to get a total points awarded to compute the final score. Thus, you see the invalid reason ="Unable to parse the output string."

May I know what LLM you are using to evaluate?

nerdai avatar Feb 02 '24 17:02 nerdai

@nerdai Thank you for your explanations. It is clear now and I will go through the links you provided. I am using "gpt-3.5-turbo" from Azure.

Maybe the result can be shown as: "1. The retrieved context partially matches the subject matter of the user's query. The context talks about what the author did growing up, including writing and programming. However, it does not provide a comprehensive answer to the query as it only covers a few aspects of the author's childhood activities. (1/2 point)\n\n2. The retrieved context cannot be used exclusively to provide a full answer to the user's query. While it provides some information about the author's childhood activities, it does not provide a complete picture of what the author did growing up. (0/2 points)\n\nOverall, the retrieved context is partially relevant to the user's query but does not provide a complete answer. \n\n[RESULT] 1.0/4.0"

Btw, I was also checking DeepEval following the links below: https://docs.llamaindex.ai/en/stable/examples/evaluation/Deepeval.html https://docs.llamaindex.ai/en/stable/examples/llm/azure_openai.html

But I got AuthenticationError: Error code: 401, maybe you know why?

qiongw avatar Feb 03 '24 08:02 qiongw

Hi, @qiongw,

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. From what I understand, the issue you raised was regarding the output score range for AnswerRelevancyEvaluator() and ContextRelevancyEvaluator() in llama_index==0.9.41, and there was confusion around the lack of documentation specifying the score range and how the final score is calculated based on the feedback. There were clarifications provided about the score range and the "rubric" grading scheme, and you also shared an example of the result, as well as an issue with an AuthenticationError when checking DeepEval.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or the issue will be automatically closed in 7 days.

Thank you!

dosubot[bot] avatar May 04 '24 16:05 dosubot[bot]