re-move Some bug in evaluation metrics with null positives

Some bug in evaluation metrics with null positives

Open bmcfee opened this issue 1 year ago • 0 comments

(Tagging @helenkesete )

We were looking at the evaluation metrics implemented here, and ran into what I think is a pretty subtle bug in the MRR calculation.

https://github.com/furkanyesiler/re-move/blob/5ddd5dfa136252c443e00c4c45c08a93cc315583/utils/metrics.py#L27-L32

The above code identifies the position of the first positive result if one exists. However, if no positive result exists, then the topk call will just return the first position because found = [0, 0, 0, 0, ...] so found - temp = [0, -1e-6, -2e-6, -3e-6, ...]. This results in an inflated metric for queries with null ytrue sets.

Looking a bit further down in the evaluation, I noticed that this case is handled correctly(*) in meanAP:

https://github.com/furkanyesiler/re-move/blob/5ddd5dfa136252c443e00c4c45c08a93cc315583/utils/metrics.py#L40-L41

where the mean over AP scores is restricted only to those queries where sum(ytrue) > 0.

There are two caveats to this:

meanAP is potentially averaged over a different query set than the other metrics, which seems not ideal.
when return_mean=False is passed in, the returned vector of per-query AP scores is the one already conditioned on having positive results, but the evaluator has lost track of the corresponding indices. This will make it difficult to link back to the input data later on.

So I have a couple of proposed modifications:

    # Identify queries with positive results
    has_positives = torch.sum(ytrue, 1) > 0

    _, spred = torch.topk(ypred, k, dim=1)
    found = torch.gather(ytrue, 1, spred)

    temp = torch.arange(k).float() * 1e-6
    _, sel = torch.topk(found - temp, 1, dim=1)

    # Knock out queries with no positives
    sel = sel.float()
    sel[~has_positives] = torch.nan

    mrr = torch.nanmean(1/(sel+1).float())
    mr = torch.nanmean((sel+1).float())
    top1 = torch.sum(found[:, 0])
    top10 = torch.sum(found[:, :10])

    pos = torch.arange(1, spred.size(1)+1).unsqueeze(0).to(ypred.device)
    prec = torch.cumsum(found, 1)/pos.float()
    mask = (found > 0).float()
    ap = torch.sum(prec*mask, 1)/(torch.sum(ytrue, 1)+eps)
    ap[~has_positives] = torch.nan

    if print_metrics:
        print('mAP: {:.3f}'.format(ap.nanmean().item()))
        print('MRR: {:.3f}'.format(mrr.item()))
        print('MR: {:.3f}'.format(mr.item()))
        print('Top1: {:.0f}'.format(top1.item()))
        print('Top10: {:.0f}'.format(top10.item()))
    return ap.nanmean() if reduce_mean else ap

Quick summary:

replace means by nanmeans
populate nans for MR and MRR results on queries with null result sets

As an alternative / suggestion: you might consider accepting a fill_nan= parameter here, which could replace nans by zeros. The logic here being that 0 could be a reasonable limiting value for MRR and meanAP (not MR though) on null result queries, and it could be reasonable to include them in some situations.

May 25 '23 19:05 bmcfee

re-move re-move copied to clipboard

Some bug in evaluation metrics with null positives

re-move
re-move copied to clipboard