ragas icon indicating copy to clipboard operation
ragas copied to clipboard

Context Recall returning NaN when using GPT-4 models

Open FranciscoAlves00 opened this issue 1 year ago • 3 comments

[x] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug When using any gpt-4 model as an evaluator, the context recall metric returns an NaN result and the following warning for almost every single question: WARNING:ragas.metrics._context_recall:Invalid JSON response. Expected dictionary with key 'Attributed'

I have tried this with my own dataset, as well as following the instructions in https://docs.ragas.io/en/stable/getstarted/evaluation.html simply changing the evaluator to any of the GPT-4 models (gpt-4-0125-preview, gpt-4-1106-preview and gpt-4). From the 10 questions in the testset, I got on average 9 NaN results for that metric. The other metrics work correctly.

Ragas version: 0.1.5 Python version: 3.10

Code to Reproduce Follow the code in https://docs.ragas.io/en/stable/getstarted/evaluation.html simply changing the evaluator to any of the GPT-4 models (gpt-4-0125-preview, gpt-4-1106-preview and gpt-4).

Error trace WARNING:ragas.metrics._context_recall:Invalid JSON response. Expected dictionary with key 'Attributed'

FranciscoAlves00 avatar Mar 23 '24 18:03 FranciscoAlves00

Hey, can you please share some data points that I can use to reproduce the issue? I'll raise a fix - this is mostly an issue related to the JSON formating which we are working on.

shahules786 avatar Mar 23 '24 18:03 shahules786

from ragas.metrics import ( answer_relevancy, faithfulness, context_recall, context_precision, answer_correctness, context_relevancy ) from ragas import evaluate from langchain.chat_models import ChatOpenAI from datasets import load_dataset

amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2") amnesty_qa

gpt4 = ChatOpenAI(model_name="gpt-4-0125-preview") #gpt4 = ChatOpenAI(model_name="gpt-4")

result = evaluate( #experiment_dataset, amnesty_qa["eval"], metrics=[ context_precision, faithfulness, answer_relevancy, context_recall, context_relevancy, answer_correctness ], llm=gpt4 )

result df = result.to_pandas()

df.head(10)

Running this code from your website I am getting 9/10 values of NaN for context recall: recall_error.json

FranciscoAlves00 avatar Mar 23 '24 19:03 FranciscoAlves00

I would like to add it works better with the gpt-4 simple model and works almost perfectly with the gpt-3.5 models. But I need to run the evaluation with the gpt4 models. Moreover, I have tried installing previous RAGAS versions and it still returns the same problems. Which is very odd, since yesterday I was being able to run the evaluations correctly.

FranciscoAlves00 avatar Mar 23 '24 19:03 FranciscoAlves00

Hi, I think this is still relevant. Context Precision and context recall return Nan for Gpt-4o models

abhinavkashyapcrayon avatar Oct 04 '24 11:10 abhinavkashyapcrayon

As of today, the issue still exists. Context recall returns Nan while other metrics are ok.

timelesshc avatar Nov 22 '24 05:11 timelesshc

Still exists, not sure why

ArindamRoy23 avatar Mar 31 '25 10:03 ArindamRoy23