ragas Bad score for answer_correctness metric

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug The answer generated by my model is similar to the ground_truths, but my answer_correctness score is very poor eventhough I have good score in the other metrics like context_recall, precision and faithfulness.

Ragas version:0.1.18 Python version:3.12

Code to Reproduce scores = evaluate( eval_dataset, metrics=[context_recall, context_precision, faithfulness, answer_correctness], llm = llm, embeddings=GoogleGenerativeAIEmbeddings(google_api_key="AIzaSyCJFrruyyyuuuytuuiysJnFcQZJa-E-QJOxyjK_X8rpBkIQ", model="models/embedding-001"), )

Error trace No error but poor answer correctness metric score

Expected behavior To get a good score for the answer_correctness like I got for the others Screenshot 2024-09-30 110333

Additional context Add any other context about the problem here.

Sep 30 '24 11:09 Simeon-Krah

A similar issue was discussed in a closed thread: metrics reproducibility: improve answer_correctness ^[1].

The poor score in the answer_correctness metric, despite having good scores in context_recall, context_precision, and faithfulness, can be attributed to the classification of statements into True Positives (TP), False Positives (FP), and False Negatives (FN). The answer_correctness metric evaluates the factuality and semantic similarity of the answer compared to the ground truth. If the answer contains statements that are not directly supported by the ground truth (FP) or misses statements that are present in the ground truth (FN), it will negatively impact the answer_correctness score.

Here is the relevant code snippet from the AnswerCorrectness class that shows how the metric is computed:

def _compute_statement_presence(
    self, prediction: AnswerCorrectnessClassification
) -> float:
    tp = len(prediction.TP)
    fp = len(prediction.FP)
    fn = len(prediction.FN)
    score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0
    return score

This function calculates the score based on the number of TP, FP, and FN statements. Even if the context-related metrics are high, the presence of FP and FN statements will lower the answer_correctness score ^[2].

To improve the answer_correctness score, consider the following best practices:

Ensure Accurate Ground Truth and Answer Statements: Make sure that the ground truth and answer statements are accurate and comprehensive.
Optimize Weights for Factuality and Semantic Similarity: Adjust the weights for factuality and semantic similarity to better reflect the importance of each aspect in your specific use case.
Improve Sentence Segmentation: Ensure that the sentence segmenter is accurately segmenting the text into meaningful statements.
Enhance the Prompt and Instructions: Refine the CORRECTNESS_PROMPT and CORRECTNESS_INSTRUCTIONS to provide clearer and more detailed guidance for the classification task.
Utilize the Answer Similarity Metric: Ensure that the AnswerSimilarity object is properly initialized and used to compute the similarity score.
Handle Edge Cases and Retry Mechanism: Implement a retry mechanism to handle cases where the LLM might fail to generate a valid response ^[2].

By following these practices, you can improve the answer_correctness score in Ragas, ensuring more accurate and reliable evaluation of your RAG pipelines.

To continue talking to Dosu, mention @dosu.

Sep 30 '24 11:09 dosubot[bot]

@Simeon-Krah can you show me a couple of examples in the following format

generated_answer
ground_truth
answer_correctness score

Sep 30 '24 15:09 jjmachan

@jjmachan sorry for late response. I lost track. You can find a link to my repo here, please:

https://github.com/Simeon-Krah/ChatEbola/blob/main/rag_langchain.ipynb

Sep 03 '25 13:09 Simeon-Krah

Can you try with latest ragas metrics collections.

from ragas.metrics.collections import AnswerCorrectness

Nov 18 '25 11:11 anistark

It seems the issue was answered, closing this now.

Nov 22 '25 00:11 github-actions[bot]