Context Precision score heavily influenced by position of relevant contexts in array

Open mshobana1990 opened this issue 8 months ago • 5 comments

[ ] I have checked the documentation and related resources and couldn't resolve my bug.

Describe the bug ContextPrecision evaluator in RAGAS produces inconsistent scores that are strongly influenced by the position of relevant contexts in the input array. When a relevant context appears as the first element in the retrieved_contexts array, the score is artificially inflated (often >0.9), even when most contexts are irrelevant and when the score gradually decreases if the relevant context's position moves ahead.

Ragas version:0.2.14 Python version:3.11

Code to Reproduce

The below test fails with the score of 0.9999999999

def test_context_precision():
    sample = SingleTurnSample(
        reference="The capital of France is Paris.",
        retrieved_contexts=[
            "The capital of France is Paris",
            "Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)",
            "it suitable for a wide range of healthcare facilities, from small clinics to large hospitals.",
            "A resilient EMR and hospital management system built on reliable open-source components.",

        ],
        user_input="What is the capital of France?",
    )

    context_precision = ContextPrecision(llm=evaluator_llm)
    score = context_precision.single_turn_score(sample)
    print("Context Precision score: ", score)
    assert score < 0.3

WHEREAS

The below test passes. Here I have just changed the position of the relevant context as the last element.

def test_context_precision():
    sample = SingleTurnSample(
        reference="The capital of France is Paris.",
        retrieved_contexts=[
            "Bahmni is a comprehensive, easy-to-use, and fully open-source Hospital Information System (HIS)",
            "it suitable for a wide range of healthcare facilities, from small clinics to large hospitals.",
            "A resilient EMR and hospital management system built on reliable open-source components.",
            "The capital of France is Paris",
        ],
        user_input="What is the capital of France?",
    )

    context_precision = ContextPrecision(llm=evaluator_llm)
    score = context_precision.single_turn_score(sample)
    print("Context Precision score: ", score)
    assert score < 0.3

Also, When the relevant context is in second position, the score is 0.49999999995 When it is is in third position, the score is 0.3333333333

Error trace Not applicable.

Expected behavior The score should not be affected by the position of relevant context.

Additional context The issue is consistent across the models Open AI GPT 4o, AWS Bedrock Claude Sonnet 3.7 and Azure OpenAI GPT4o. This position bias severely impacts the reliability of the Context Precision metric for evaluating RAG systems. In real-world applications, the ordering of retrieved contexts should not affect the precision score, as the goal is to measure how many of the retrieved contexts are actually relevant to the question.

Apr 15 '25 13:04 mshobana1990