Bad score for answer_correctness metric
[ ] I have checked the documentation and related resources and couldn't resolve my bug.
Describe the bug The answer generated by my model is similar to the ground_truths, but my answer_correctness score is very poor eventhough I have good score in the other metrics like context_recall, precision and faithfulness.
Ragas version:0.1.18 Python version:3.12
Code to Reproduce scores = evaluate( eval_dataset, metrics=[context_recall, context_precision, faithfulness, answer_correctness], llm = llm, embeddings=GoogleGenerativeAIEmbeddings(google_api_key="AIzaSyCJFrruyyyuuuytuuiysJnFcQZJa-E-QJOxyjK_X8rpBkIQ", model="models/embedding-001"), )
Error trace No error but poor answer correctness metric score
Expected behavior
To get a good score for the answer_correctness like I got for the others
Additional context Add any other context about the problem here.
A similar issue was discussed in a closed thread: metrics reproducibility: improve answer_correctness [1].
The poor score in the answer_correctness metric, despite having good scores in context_recall, context_precision, and faithfulness, can be attributed to the classification of statements into True Positives (TP), False Positives (FP), and False Negatives (FN). The answer_correctness metric evaluates the factuality and semantic similarity of the answer compared to the ground truth. If the answer contains statements that are not directly supported by the ground truth (FP) or misses statements that are present in the ground truth (FN), it will negatively impact the answer_correctness score.
Here is the relevant code snippet from the AnswerCorrectness class that shows how the metric is computed:
def _compute_statement_presence(
self, prediction: AnswerCorrectnessClassification
) -> float:
tp = len(prediction.TP)
fp = len(prediction.FP)
fn = len(prediction.FN)
score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0
return score
This function calculates the score based on the number of TP, FP, and FN statements. Even if the context-related metrics are high, the presence of FP and FN statements will lower the answer_correctness score [2].
To improve the answer_correctness score, consider the following best practices:
- Ensure Accurate Ground Truth and Answer Statements: Make sure that the ground truth and answer statements are accurate and comprehensive.
- Optimize Weights for Factuality and Semantic Similarity: Adjust the weights for factuality and semantic similarity to better reflect the importance of each aspect in your specific use case.
- Improve Sentence Segmentation: Ensure that the sentence segmenter is accurately segmenting the text into meaningful statements.
- Enhance the Prompt and Instructions: Refine the
CORRECTNESS_PROMPTandCORRECTNESS_INSTRUCTIONSto provide clearer and more detailed guidance for the classification task. - Utilize the Answer Similarity Metric: Ensure that the
AnswerSimilarityobject is properly initialized and used to compute the similarity score. - Handle Edge Cases and Retry Mechanism: Implement a retry mechanism to handle cases where the LLM might fail to generate a valid response [2].
By following these practices, you can improve the answer_correctness score in Ragas, ensuring more accurate and reliable evaluation of your RAG pipelines.
To continue talking to Dosu, mention @dosu.
@Simeon-Krah can you show me a couple of examples in the following format
- generated_answer
- ground_truth
- answer_correctness score
@jjmachan sorry for late response. I lost track. You can find a link to my repo here, please:
https://github.com/Simeon-Krah/ChatEbola/blob/main/rag_langchain.ipynb
Can you try with latest ragas metrics collections.
from ragas.metrics.collections import AnswerCorrectness
It seems the issue was answered, closing this now.