re-move
re-move copied to clipboard
Some bug in evaluation metrics with null positives
(Tagging @helenkesete )
We were looking at the evaluation metrics implemented here, and ran into what I think is a pretty subtle bug in the MRR calculation.
https://github.com/furkanyesiler/re-move/blob/5ddd5dfa136252c443e00c4c45c08a93cc315583/utils/metrics.py#L27-L32
The above code identifies the position of the first positive result if one exists. However, if no positive result exists, then the topk
call will just return the first position because found = [0, 0, 0, 0, ...]
so found - temp = [0, -1e-6, -2e-6, -3e-6, ...]
. This results in an inflated metric for queries with null ytrue
sets.
Looking a bit further down in the evaluation, I noticed that this case is handled correctly(*) in meanAP:
https://github.com/furkanyesiler/re-move/blob/5ddd5dfa136252c443e00c4c45c08a93cc315583/utils/metrics.py#L40-L41
where the mean over AP scores is restricted only to those queries where sum(ytrue) > 0
.
There are two caveats to this:
- meanAP is potentially averaged over a different query set than the other metrics, which seems not ideal.
- when
return_mean=False
is passed in, the returned vector of per-query AP scores is the one already conditioned on having positive results, but the evaluator has lost track of the corresponding indices. This will make it difficult to link back to the input data later on.
So I have a couple of proposed modifications:
# Identify queries with positive results
has_positives = torch.sum(ytrue, 1) > 0
_, spred = torch.topk(ypred, k, dim=1)
found = torch.gather(ytrue, 1, spred)
temp = torch.arange(k).float() * 1e-6
_, sel = torch.topk(found - temp, 1, dim=1)
# Knock out queries with no positives
sel = sel.float()
sel[~has_positives] = torch.nan
mrr = torch.nanmean(1/(sel+1).float())
mr = torch.nanmean((sel+1).float())
top1 = torch.sum(found[:, 0])
top10 = torch.sum(found[:, :10])
pos = torch.arange(1, spred.size(1)+1).unsqueeze(0).to(ypred.device)
prec = torch.cumsum(found, 1)/pos.float()
mask = (found > 0).float()
ap = torch.sum(prec*mask, 1)/(torch.sum(ytrue, 1)+eps)
ap[~has_positives] = torch.nan
if print_metrics:
print('mAP: {:.3f}'.format(ap.nanmean().item()))
print('MRR: {:.3f}'.format(mrr.item()))
print('MR: {:.3f}'.format(mr.item()))
print('Top1: {:.0f}'.format(top1.item()))
print('Top10: {:.0f}'.format(top10.item()))
return ap.nanmean() if reduce_mean else ap
Quick summary:
- replace means by nanmeans
- populate nans for MR and MRR results on queries with null result sets
As an alternative / suggestion: you might consider accepting a fill_nan=
parameter here, which could replace nans by zeros. The logic here being that 0 could be a reasonable limiting value for MRR and meanAP (not MR though) on null result queries, and it could be reasonable to include them in some situations.