nervaluate Inconsistency between Agregate Results and Indices Results

Okay, kinda of a long story, so please bear with me.

It all started when I wanted to merge the Partial and Type methods of evaluation, by that I mean, I wanted the Partial evaluation but only if the Type was correct.

So my aproach was to get the results_indices from both lists with the following code:

partial_cor = result_indices['partial']['correct_indices']
partial_inc = result_indices['partial']['incorrect_indices']
partial_par = result_indices['partial']['partial_indices']

ent_type_cor = result_indices['ent_type']['correct_indices']
ent_type_inc = result_indices['ent_type']['incorrect_indices']

# All with Incorrect Type Stay Incorrect
mix_inc = ent_type_inc

# The ones with Correct Type are dived between Correct and Partial acording to the Partial Evaluation
mix_cor = [x for x in ent_type_cor if x in partial_cor]
mix_par = [x for x in ent_type_cor if x in partial_par]

I would then calculate the aggregated result by the Lenghts of the arrays.

I created a fake dataset to validate my algorithm:

test_dataset = [
    # Entity and Type Corrects
    {
        "true": [{"label": "COR", "start": 0, "end": 10}],
        "pred": [{"label": "COR", "start": 0, "end": 10}]
    },
    # Correct Entity and Incorrect Type
    {
        "true": [{"label": "COR", "start": 0, "end": 10}],
        "pred": [{"label": "INC", "start": 0, "end": 10}]
    },
    # Partial Entity and Correct Type
    {
        "true": [{"label": "COR", "start": 0, "end": 10}],
        "pred": [{"label": "COR", "start": 1, "end": 9}]
    },
    # Partial Entity and Incorrect Type
    {
        "true": [{"label": "COR", "start": 0, "end": 10}],
        "pred": [{"label": "INC", "start": 1, "end": 9}]
    }
]

true = [x['true'] for x in test_dataset]
pred = [x['pred'] for x in test_dataset]

I was expecting the following results:

"partial": {
    "correct": 2,
    "incorrect": 0,
    "partial": 2
}
"ent_type": {
    "correct": 2,
    "incorrect": 2,
    "partial": 0
}
"mix": {
    "correct": 1,
    "incorrect": 2,
    "partial": 1
}

When I ran the code, I got what I expected for the aggregated results, but not for the result indices:

evaluator = Evaluator(true, pred, tags=['COR', 'INC'], loader="default")
results, results_per_tag, result_indices, result_indices_by_tag = evaluator.evaluate()

print("Type")
print(json.dumps(results['ent_type'], indent=4, ensure_ascii=False))

print("Partial")
print(json.dumps(results['partial'], indent=4, ensure_ascii=False))

>>>>
Type
{
    "correct": 2,
    "incorrect": 2,
    "partial": 0,
    "missed": 0,
    "spurious": 0,
    "possible": 4,
    "actual": 4,
    "precision": 0.5,
    "recall": 0.5,
    "f1": 0.5
}
Partial
{
    "correct": 2,
    "incorrect": 0,
    "partial": 2,
    "missed": 0,
    "spurious": 0,
    "possible": 4,
    "actual": 4,
    "precision": 0.75,
    "recall": 0.75,
    "f1": 0.75
}

But, for some reason, the indices did not match:

print("Type")
print(json.dumps(result_indices['ent_type'], indent=4, ensure_ascii=False))

print("Partial")
print(json.dumps(result_indices['partial'], indent=4, ensure_ascii=False))

Type
{
    "correct_indices": [
        [0, 0]
    ],
    "incorrect_indices": [
        [1, 0],
        [3, 0]
    ],
    "partial_indices": [],
    "missed_indices": [],
    "spurious_indices": []
}
Partial
{
    "correct_indices": [
        [0, 0]
    ],
    "incorrect_indices": [],
    "partial_indices": [
        [2, 0],
        [3, 0]
    ],
    "missed_indices": [],
    "spurious_indices": []
}

It really bugged me, because the Type shoud have added the index [2, 0] to the correct_indices, and the Partial should have added the index [1, 0] to the correct_indices. If this entities are counted in the aggregated vision, they should be in the indices vision, right?

I apologize if I am missing something obvious. I would appreciate any help understanding this outcome.

I would also appreciate any help to find another way to achive my desired result using this library?

Oct 26 '24 17:10 rodrigues-pedro

hey @rodrigues-pedro, thanks so much for taking the time to flag this. I added the indices results feature. I just saw this issue by chance - I’m tight for time at the moment but I will look into this as soon as I can.

Nov 19 '24 09:11 jackboyla

hey @rodrigues-pedro sorry for such a delay but better late than never😅

I have tried to replicate your problem but it now seems to be solved already (nice job @davidsbatista !):

def test_evaluation_type_merge():
    test_dataset = [
        # Entity and Type Corrects
        {
            "true": [{"label": "COR", "start": 0, "end": 10}],
            "pred": [{"label": "COR", "start": 0, "end": 10}]
        },
        # Correct Entity and Incorrect Type
        {
            "true": [{"label": "COR", "start": 0, "end": 10}],
            "pred": [{"label": "INC", "start": 0, "end": 10}]
        },
        # Partial Entity and Correct Type
        {
            "true": [{"label": "COR", "start": 0, "end": 10}],
            "pred": [{"label": "COR", "start": 1, "end": 9}]
        },
        # Partial Entity and Incorrect Type
        {
            "true": [{"label": "COR", "start": 0, "end": 10}],
            "pred": [{"label": "INC", "start": 1, "end": 9}]
        }
    ]

    true = [x['true'] for x in test_dataset]
    pred = [x['pred'] for x in test_dataset]


    evaluator = Evaluator(true, pred, tags=['COR', 'INC'], loader="default")
    results = evaluator.evaluate()

    # Aggregated 

    results['overall']['ent_type']
    # EvaluationResult(correct=2, incorrect=2, partial=0, missed=0, spurious=0, precision=0.5, recall=0.5, f1=0.5, actual=4, possible=4)

    results['overall']['partial']
    # EvaluationResult(correct=2, incorrect=0, partial=2, missed=0, spurious=0, precision=0.5, recall=0.5, f1=0.5, actual=4, possible=4)

    # Indices
    
    results['overall_indices']['ent_type']
    # EvaluationIndices(correct_indices=[(0, 0), (2, 0)], incorrect_indices=[(1, 0), (3, 0)], partial_indices=[], missed_indices=[], spurious_indices=[])

    results['overall_indices']['partial']
    # EvaluationIndices(correct_indices=[(0, 0), (1, 0)], incorrect_indices=[], partial_indices=[(2, 0), (3, 0)], missed_indices=[], spurious_indices=[])

Jun 08 '25 13:06 jackboyla

seems this is solved on the new release, I will close this issue.

Sep 10 '25 12:09 davidsbatista