recommenders
recommenders copied to clipboard
Is the 'mrr_score' implementation correct?
Hi,
I was recently using the mrr_score
implementation (link):
def mrr_score(y_true, y_score):
"""Computing mrr score metric.
Args:
y_true (np.ndarray): Ground-truth labels.
y_score (np.ndarray): Predicted labels.
Returns:
numpy.ndarray: mrr scores.
"""
order = np.argsort(y_score)[::-1]
y_true = np.take(y_true, order)
rr_score = y_true / (np.arange(len(y_true)) + 1)
return np.sum(rr_score) / np.sum(y_true)
I am not sure if I've misunderstood the current implementation, but as far as I can see, it does not account for situations where there are multiple positive examples in one sample:
>>> mrr_score([1, 0, 0], [1, 0, 0])
1.0
>>> mrr_score([1, 1, 0], [1, 1, 0])
0.75
Furthermore, according to documentation the input should be Predicted labels
, however, we are more interested in the ranking of the positive item in a given sample (MRR-wiki).
My suggestion is
def reciprocal_rank_score(y_true: np.ndarray, y_pred: np.ndarray) -> float:
order = np.argsort(y_pred)[::-1]
y_true = np.take(y_true, order)
first_positive_rank = np.argmax(y_true) + 1
return 1.0 / first_positive_rank
>>> y_true_1 = np.array([0, 0, 1])
>>> y_pred_1 = np.array([0.5, 0.2, 0.1])
>>> reciprocal_rank_score(y_true_1, y_pred_1)
0.33
>>> y_true_2 = np.array([0, 1, 1])
>>> y_pred_2 = np.array([0.5, 0.2, 0.1])
>>> reciprocal_rank_score(y_true_2, y_pred_2)
0.5
>>> y_true_3 = np.array([1, 1, 0])
>>> y_pred_3 = np.array([0.5, 0.2, 0.1])
>>> reciprocal_rank_score(y_true_3, y_pred_3)
1.0
>>> np.mean([reciprocal_rank_score(y_true, y_pred)for y_true, y_pred in zip([y_true_1, y_true_2, y_true_3], [y_pred_1, y_pred_2, y_pred_3])])
0.611111111111111
The original implementation seems correct if asked for the rankings, not the labels for the prediction. When assuming all items are positive, as in my example:
mrr_score([1, 1, 1], [3, 2, 1])
0.611111111111111
But then, y_true
is not a needed input.
If I haven't misunderstood and you agree I would be happy to make a PR with suggested improvements.
I followed the example used in the Medium post: MRR vs MAP vs NDCG: Rank-Aware Evaluation Metrics And When To Use Them (behind paywall).
Thanks for the awesome repo!