recommenders
recommenders copied to clipboard
[Question] Difference between top_k_categorical_accuracy_at_100 and recall at 100
Hi Team,
Thanks for the awesome discussions, I've been learning a lot from the discussions in the issues, and many of them improved the accuracy of the work I'm doing. I'm wondering if there's a big difference in calculating these two metrics or not as I'm able to get around 0.9 top_k_categorical_accuracy_at_100, but when I'm doing offline evaluation using recall at k I get around 0.35.
From my understanding top_k_categorical_accuracy_at_100 is count(interactions that the ground truth was ranked in the top k)
If my batch size is 1028, I can assume that top_k_categorical_accuracy_at_100 of 0.9 means that out of random 1028 candidates we get the right one in the top 100 candidates 90% of the cases, am i missing sth?
This is very close to the definition of recall I'm calculating recall at 100 = intersection(user future interactions, top 100 candidates for this user embedding)/len(user future interactions)
The key differences are for top_k_categorical_accuracy_at_100:
- candidates are selected randomly and I've a total of 3000 candidates, so many duplicates will appear in a batch of 1028 so the results would be optimistic (as the unique candidates would be much less than that)
- not all candidates (restaurants) can be deliver to queries (users) for all cases, so the right candidates are a subset of all candidates, if the model learnt that, it'll be able to easily neglect the candidates that can't interact with the query --> which will lead to an easier job for classifier. On the other hand
for Recall at 100 calculation
- I only considers the candidates (restaurants) that can deliver to the user so it's a harder problem
- No duplications, as i only deal with the unique candidates.
But I don't expect that the results would be that different 95% for top_k_categorical_accuracy_at_100 and 35% for recall at 100
Hi @OmarMAmin,
Not knowing the frequency distribution of your candidates, along with the fact that you have delivery constraints make it hard for me to have any intuition on this.
I would perhaps first compare a single metric, (top_k accuracy) in-batch vs over unique candidates. If these differ significantly I'd look at some examples and work from there.