Possible bug in evaluate function
Describe the bug I have been using the evaluate function in ragas to obtain different metrics of a RAG system, the evaluation consists on 51 questions with gorund truth.
I use evaluate as:
evaluation_df = evaluate(dataset=dataset, metrics=metrics, llm=vertex_llm, embeddings=vertex_embeddings).to_pandas()
with metrics = [faithfulness, context_precision, context_relevancy, answer_relevancy, answer_correctness]
I did an study of stability, meaning, launching the same questions to the evaluation system and checking if scores varies. We all know that LLM's do not always return the same, both for the RAG and the evaluation. I launched each question 11 times.
For the metric answer_correctness with gemini-pro I obtained:
Each color represents a question, each dot a score and each error bar the std.
I was comparing with other LLM's to check if the LLM was the culprit when I realised that If I use the evaluate function with a single ragas metric this effect of non-replicability was highly reduced.
I changed evaluation method to:
df = None for metric in metrics: evaluation_df = evaluate_ragas(dataset=dataset, metrics=[metric], llm=vertex_llm, embeddings=vertex_embeddings).to_pandas() if df is None: df = evaluation_df else: df[metric.name] = evaluation_df[metric.name]
Just doing one metric at a time. It is much slower but the results improved:
For a clearer comparison I have sorted the questions based on std. Obtaining:
First case:
Second case:
I decided to use a bar graph of the metrics. Bar height is the average score and error bar is the average std.
First case:
Second case:
In the first case both the average value and the std seems to be fairly similar for all metrics while in the second case is not like this at all. I think there is something off with the results. I don't know how the code works so I cannot give more insights but I think this is something it should be be looked at.
If you need anything else please let me know.
Ragas version: 0.1.1 Python version: 3.10.13
Code to Reproduce I cannot share the dataset because its not public. But I think others might behave the same way.
@shahules786 Wow, this is great. Awesome work. I did some analysis earlier this year on reproducibility using KL divergence and scatter plots of experiment results of two different runs. I would love to get on with a call and chat with you on this later this week if possible, can we do that, please? my calendly
This week is not possible for me, but I have scheduled a meeting for next Monday. Please feel free to email me if you need anything else!
hey @abetatos thanks for sharing your findings with us - ensuring reproducibility is a top concern of ours so we'll try to improve this as fast as possible
I'm also running some tests here #671 to ensure that its not any obvious bugs. Looking forward to meeting you on Monday 🙂