Question regarding how preferences are modeled with BPR loss
Hi all!
First of all, thank you so much for the effort in this lib. Works really well :)
However, while using it with some implicit data (in fact, count data) I had a question on how the training is done with BPR loss. Let me explain:
As you already know: for modelling preferences with BPR loss, a pairwise loss is used; so each training instance is a triplet in the form of (user, item1, item2); where the model learns the preference between the two items for that specific user.
Imagine that I have an implicit dataset with play counts for movies (how many times each user has watched a specific movie). To make things easier, let's imagine that I only have a single user, and the following interaction data:
- Movie A, watched 8 times
- Movie B, watched 2 times
- Movie C, watched 2 times
- Movie D, watched 0 times (this is, no interaction)
I was wondering how the training triplets would look like with such dataset. We know the preference for all possible triplet combinations, except for the one between B and C (as their play count is exactly the same; so we cannot infer any preference between them).
However by skimming through the source code, I would say that the way the negative sampling works in LightFM with BPR loss would yield only the following training triplets:
(user, Movie A, Movie D)(user, Movie B, Movie D)(user, Movie C, Movie D)
Which is correct (as BPR assumes preference on all watched movies over the unwatched one). However, the model would me missing some valuable information: our user prefers Movie A over B and C also (as he/she has watched it way more times).
But for what I have seen in the source code, non-zero items are only compared against zero (unwatched) ones, therefore missing the prefer A over B/D training instances.
Is that right? If that is the case, would you be interested in a contribution to include those additional preferences as part of the training procedure?
Also, does it work the same way for WARP loss?
Thanks a ton!
Ok some updates on this:
I realized that you can include sample weights on the interactions data (thus, effectively on the sampled triplets as the weights for the zero-valued negative samples are just ignored, for obvious reasons).
To some extent, this allows the model to learn the higher importance of the preference between an specific item (say, movie A following the previous example) and the zero ones, over movies B/C and the zero ones; which is nice since it is a proxy of interaction/preference strength between items.
I would say that in a large dataset, this instance weighting should be enough to learn the relative higher/lower preferences agmonst items for an specific user. However, I would say that the explicit training on preferences between nonzero items is still valuable info for the model (specially in mid-sized datasets).
This last point is only a hypothesis from my side, as it cannot be demonstrated analytically and would need experimentation and testing. That is why I think this feature could be valuable.
Thank you!