Bug: kNN computes on k-first among whatever neighbors it finds. Instead of fixed k-neighbours
Current approach in KNNBasic finds whatever are the closest k-neighbours and computes prediction on them. Although it helps to get at least some kind of prediction for each item - it actually compromises the basic idea of comparability of ratings for different items: With the current implementation, one item can get a higher prediction than another, and we might recommend it, while its prediction might have been derived from some remote neighbours, and is therefore unreliable. So, for K-means - we need to find k nearest neighbours first and then consider only their ratings to compute predictions.
BTW, it makes sense to store the neighbours within the fitted model, in order to avoid reiterating look up for closest neighbours, as a costly operation. And to make the fitted model more compact - we can get rid of the other part of similarity matrix, after fitting the model, as we don't actually need it for predictions.
With the current implementation, one item can get a higher prediction than another, and we might recommend it, while its prediction might have been derived from some remote neighbours, and is therefore unreliable
This is not really a bug, it's just the way k-NN is defined. Using a neighborhood that's actually far-away is a well-known drawback of the neighborhood methods. Also note that Surprise does exactly what it advertises: it is clear from the docs that only the neighbors having rated the target item are used (for user-user similarities).
I'm not sure why you're mentioning k-means here?
BTW, it makes sense to store the neighbours within the fitted model, in order to avoid reiterating look up for closest neighbours, as a costly operation. And to make the fitted model more compact - we can get rid of the other part of similarity matrix, after fitting the model, as we don't actually need it for predictions.
Storing the neighbors during training requires extra work during the prediction step: we can only consider neighbors that have rated the target item so we'll need to filter out those who did not. Also, storing a U x I matrix may or may not be more memory efficient than storing a UxU or IxI matrix, depending on the number of users and items.
In the current implementation, finding the k-nearest neighbors is fairly efficient because the sorting is done only over the other users that have rated the target item (in the case of user-user similarities). So we're sorting a fairly short list each time instead of sorting a huge list once and for all.
If you want to implement the method you suggest I'd be happy to see the results!
Nicolas
This is not really a bug, it's just the way k-NN is defined. Using a neighbourhood that's actually far-away is a well-known drawback of the neighbourhood methods.
As far as I understand the kNN idea - it is not defined this way. Rather we should always the same fixed neighbours and recommend only those items, that were rated by some of these fixed neighbours. Thus, they can't always compute prediction for any given item, which is a well-known drawback of kNN. I formed my vision from: "Recommender Systems", by Dietmar Jannach; Markus Zanker; Alexander Felfernig, Published by Cambridge University Press, 2010. (not sure whether it has a direct formula, but it becomes clear from the text) And a course: https://www.coursera.org/learn/collaborative-filtering, where, besides the lectures covering the topic, a kNN example calculation is given in Excel, within a second-week exercise, with examples of the correct results. I have decided to fulfil these computations with surprise lib instead of excel, and this is how I found that surprise returns incorrect results (they won't match examples from the exercise).
Finally, there is a common sense approach. Consider an example: in case we choose any closest k neighbours among those who have rated some item - the top-N recommendations list will fail:
- Imagine item A with highest marks given by fairly remote k neighbours (having 0.000001 correlation, as library currently checks that sim>0). Weighted mean mark will be 5
- item B rated slightly worse by our k nearest neighbours (having correlation=1), with weighted mean mark=4.95 within "top-N" recommendations, sorted by rating this approach would place higher item A recommended by neighbours with average 0.000001 similarity. Doesn't this seem strange?
A major issue with your approach is that as it's fairly unlikely the k-nearest neighbors of a user have rated the target item, the prediction is most of the time impossible and in the rare cases where there is one, it relies on only few ratings (much less than k). So it's not reliable either. Obviously these are just conjectures of mine, it should be empirically verified to check if that holds.
I think we can agree that both approaches have pros and cons.
The one that is currently implemented is, as far as I know, the most commonly used. See e.g. The recommender system handbook by Ricci & all or The recommender system textbook by Aggarwal (I find the latter to be much more useful).
Nicolas, what I found in the book you offered seems to support the vision of KNN which I described: The recommender system textbook by Aggarwal, p.9 :
" In general, the k most similar users to Bob can be used to make rating predictions for Bob. " ... For example, it might be difficult to find sufficiently similar users to Bob, who have rated Gladiator. In such cases, it is difficult to robustly predict Bob’s rating of Gladiator. In other words, such methods might lack full coverage of rating predictions. Nevertheless, the lack of coverage is often not an issue, when only the top-k items are required.
Well TBH these statements are true for both strategies.
I agree though that it's not so clear what strategy the author is refering to. E.g. p.36 eq. 2.4:
Let P u(j) be the set of k closest users to target user u, who have specified ratings for item j.
I always thought that this means we should first filter the users who haven't rated i, and then see which ones are the closest. But thanks to your remarks I understand now that it's not so clear. It could mean either your definition, or the one I've been using.
The same goes for the handbook:
The k-nearest-neighbors (k-NN) of u, denoted by N (u), are the k users v with the highest similarity w_uv to u. However, only the users who have rated item i can be used in the prediction of r_ui , and we instead consider the k users most similar to u that have rated i. We write this set of neighbors as Ni(u).
It's not clear whether Ni(u) is a subset of N(u) (your understanding) or not (mine).
In any case, if you want to implement your solution please go for it!
you mean I may substitute current logic for .predict() for all kNN algorithms? Do you confirm that I can also adjust .fit() logic in order to store only top-k neighbours and their respectful similarity (instead of keeping full sim matrix)? I believe it makes sense to leave full similiarity matrix computation, but keep it optional, only for those who want to double-check similarity matrix computation (e.g. I compared it to the one obtained from pandas).
Refactoring plan
Aim - least changes to current code. Optional acces to current sim.matrix
- [ ] Move "Compute similarity" from AlgoBase into SymmetricAlg (as it belongs to kNN only)
- [ ] Insert compute_kNN in the end of every fit method
- [ ] make optional .sim matrix (by default - kill it after kNN is computed)
- [ ] change .prediction() for all kNN
- [ ] add some tests
- [ ] Optionaly - add top_n_recommendation(uid)
you mean I may substitute current logic for .predict() for all kNN algorithms?
No I think both options should be available, and the current one (even if it may not be the most natural for some) should be the default, to conserve backward compatibility. This could be set via a parameter neighbors_set whose values could be either extended (default) and conservative (the new version). This is just a suggestion of course, but the possible values should be as explanatory as possible. Also predict() should not be changed: only fit() and estimate().
Do you confirm that I can also adjust .fit() logic in order to store only top-k neighbours and their respectful similarity (instead of keeping full sim matrix)?
As I said earlier I'm not sure this would improve performance. If it does improve performance in some cases (whether it's computation time or memory use), then I'd be happy to support it. But it should be done in a separate PR as this is for the most part unrelated to the neighbors selection.
Move "Compute similarity" from AlgoBase into SymmetricAlg (as it belongs to kNN only)
Good idea
Insert compute_kNN in the end of every fit method
I'm not sure what you mean?
make optional .sim matrix (by default - kill it after kNN is computed)
Ok if that's useful, but in a different PR (see remark above)
change .prediction() for all kNN
There is no prediction() method, did you mean estimate()?
add some tests
YES :)
Optionaly - add top_n_recommendation(uid)
There is already a guide for this in the FAQ if I'm not mistaken.
Good luck in any case, and thank you for addressing this :) ! If you haven't already, please check out the contribution guidelines. If they're not clear enough let me know, I'll try to improve it.
Quick update: I checked mymedialite and librec versions of kNN. If I understand their code correctly, they both compute the set of nearest neighbors and only then filter out those not having rated the target item, which supports your version :)
So, in case other libraries, as you have checked, have different implementations - I offer to change the default behaviour of the current kNN implementation, rather than introduce new behaviour as just an option. Old estimate() behaviour, I believe, should be abolished as inconsistent with other implementations. While optional full sim matrix might still be available after fitting the kNN, in order to make it easier to verify correctness of similarity measure computations
Like I said I prefer to keep the current version as the default one for backward compatibility reasons. If we change the default algorithm, then some people will find that their predictions have changed when upgrading to the new version even if haven't change their code.
I believe that being consistent with other (standard) way of kNN implementation - is more important than being consistent with previous versions. I don't see this change in prediction as a bad thing itself, in case it provides better top-N recommendations. And I am concerned that in case we keep current behaviour as a default one - new users will get this inconsistent predictions, as in many cases people just stay with default options. Besides, I see value in keeping .sim property for analysis, but I don't see any value in keeping an opportunity to get current-style predictions at all. As they potentially provide wrong top-N recommendations.
I am very concerned about the need to change the behaviour. And in case you are concerned about letting it know to users - you may, for instance, just update the major library version (2.x), as one should usually expect some change in behaviour in new major versions (in case we add top_N prediction to the API such version change will be more justifiable, as it signals the opportunity to use built-in method rather then implementing it through some adapter over the library)
I understand your point. Changing API and default behaviour in a next major release might be the solution. But for now, let's just keep the current startegy as the default one: we can change the default later.
I consider leaving default behaviour as is - a bug proliferation. As it means that all new users will have wrong default behaviour and old users, by default, also will get deteriorate top-N predictions. So, I won't agree to implement it this way (leaving current behaviour as a default). I also want to find a way to address your concern about implicit informing about the change of behaviour. So, let's find some way to achieve both goals simultaneously: either immediately releasing new major version, or some other way (like updating lib description about the change of behaviour, or adding some additional note to prediction results).
I consider leaving default behaviour as is - a bug proliferation. [...] all new users will have wrong default behaviour
Once again, the current implementation is neither wrong nor correct. It's not a bug. It's just the way it is, and it does exactly what it claims it does. It's a variant of the neighborhood approach, just like the one you propose is another variant (and there are many, many variants). They both have pros and cons, and we may prefer one or the other depending on our goals. If that's of any comfort to you, Crab proposes both strategies for selecting the neighbors.
old users, by default, also will get deteriorate top-N predictions
That's still to be (empirically) proven. The version that will end-up being the default will be the one that performs best (criteria being accuracy and top-N metrics, computation time and memory usage).
If the new method ends up being the default one, this will happen on a new major release (note that this is still hypothetical).
Also, we won't release version 2.0 for this change alone. Version 2.0 will be out once I finish working on implicit feedback (I can't give any timeline right now). In the mean time, nothing prevents us from releasing 1.0.6 with your changes, keeping the current neighborhood as the default one. That would give us the opportunity to add a warning in the doc indicating that the default will probably change in next releases.
You are right - both strategies to define k-neighbours are used. But they are commonly associated with the method:
- non-fixed k (as you have implemented it) - is used for item-item kNN.
- fixed k is used for user_based KNN.
I propose to implement such default behaviour, without any params, to avoid cluttering API, i.e. leave current behaviour as is for user_based=False, and change it for user_based=True.
Have a look at the Args docstring for kNN we will have to add otherwise. I would prefer to avoid it:
fix_k_neighbors(boolean): Defines whether k neighbours used to
predict similarity are fixed, i.e. same neighbours are used for any
item, or k neighbours are obtained for each item as the closest
neighbours among those who actually rated this item (for user_based
similarity) or among all similar items (for item-item similarity).
Default = False
- For item-item kNN - usual strategy is to obtain k
closest items, among rated by current user (use default)
- For user_based kNN - to obtain some kind of prediction for
a greater number of items use non-fixed neighbours (default).
To obtain consistent ranking among top-N predictions - use fixed k.
But they are commonly associated with the method [...]
I didn't know that, thanks for the info. Would you have any reference?
I propose to implement such default behaviour, without any params, to avoid cluttering API, i.e. leave current behaviour as is for user_based=False, and change it for user_based=True.
If what you're saying above is true then changing the default depending on user_based is indeed a good option (defaults should only change in the next major release though, I will not change my mind on this :) ). I still want to provide users with the possibility to change it, so there should still be a parameter.
Surprise is about letting users have complete control over their experiments, so we need to make everything explicit and tunable. So we should be able to chose whatever neighborhood we want. I don't mind long docstring as long as they're clear and useful.
As I have just got it - they pick any k only because they have another variable to confine the neighborhood to look for this k. So, this can't be used even for i-i without adding additional param "the model size"
This was in video-lectures on my coursera course: The item-item model is often truncated to only keep M neighbors per item. The scorer uses at most k neighbors to compute each prediction. M must be significantly larger than k, because the user we’re predicting for may not have rated all M of the neighbor items; we need enough potential neighbors to be able to find k of them among the user’s ratings. So, here "M" is the same as "k" in the current surprise implementation
Original paper, describing item-item CF: Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th international conference on World Wide web (WWW '01). ACM, New York, NY, USA, 285-295. See 4.4: A model size of l means that we only considered l best similarity values for model building and later on used k of them for the prediction generation process, where k < l.
So, seems that users should either compute similarity in a "fixed k" approach, or we should add a confine on the model size (fixed number of neighbors to look for k-closest among them). Would you agree to add such a param immediately? It is logical, as it will allow to effectively use both behaviors to look for neighbors, in both user-based and item-based approaches.
Although, I may add-up this param with a different PR. It won't result in double work in changing the code
Sorry Олег I'm not sure I completely understand everything, but I don't see how this supports your claim that the different strategies are preferred depending on whether we're computing similarities between users or items?
I don't see how this supports your claim that the different strategies are preferred depending on whether we're computing similarities between users or items?
In short: for item-item recommendations not same k is used, but k closest items, among those, that were already rated by the used.
Ok but it is simply specific to the proposed algorithm in this paper, not a general recommended practice, right?
Anyway, let's keep things simple.
The best course of action for now is to add a neighborhood parameter to all k-NN algorithms, which can be either extended (default, current strategy) and conservative. This parameter indicates how the neighborhood is defined (u and i being the target user and item):
- extended: the set of neighbors is defined as the set of k closest users to u, among those that have rated i.
- conservative: the set of neighbors is defined as the set of k closest users to u, regardless of the existence of their rating for i.
These are the definitions for user-user similarities. The item-item case is analogous.
If the conservative approach is used, then indeed the similarity matrix could be thrown away once all neighbors are stored. We could allow this option via another parameter, but this should be done in another PR: all things in good time.
Nicolas
I would rather keep the parameter as I have specified above (fix_k_closest). The idea of neighborhood is tricky, because the item-item paper I linked above - uses another parameter to confine neighborhood ("model size"). And I believe that we should pay attention to this paper, as it is the one that originally introduced the notion of item-item kNN.
fix_k_closest is misleading and does not reflect any of the two behaviours.
neighborhood is versatile enough: we can allow different possible values later if we want to add more neighborhood selection strategies, such as the one of the paper for example.
That being said, just because this is the paper introducing item-item doesn't mean we should implement it: it's more than 15 years old and practices evolve a lot.
fix_k_closest - is clear, together with a good param description. Previously you agreed on the example of the description. The bad thing with "neighborhood" name is that it complicates further introduction of another param of "model size"/"neighborhood size", which is a recommended contemporary approach to item-item recommendations, according to the 2016 Coursera course.
fix_k_closest - is clear, together with a good param description
No, it's not. The number of neighbors is not fixed in either of the methods.
Previously you agreed on the example of the description.
No, I did not. I just said that I didn't mind long docstring as long as they are clear. I'm sorry but the one you proposed was not.
Can you please expend a bit on the model size / neighborhood size? I check the paper briefly but did not completely understand what it means and I don't have time to read it carefully for now. A quick TLDR would be helpful.
k neighbors are absolutely fixed in the method that I advocate.
Idea of real-world implementation is to decrease memory consumption for the trained model and make predictions faster, by precomputing as many factors at the model training stage, as possible.
So, only k closest neighbors and their respective similarity are necessary to keep as a part of user-based model (and that is why I advocate "fix_k_closest", or "fixed_k" param name).
For item-based models: k-neighbors are not fixed, like in the way you currently implemented. Probably the reason is that its usage is recommended only for situations where there are much more users, than items, and users have rated only a small subset of items, while every item was rated by many users. In this situation it is proposed to look for k-neighbors of a given item, rated by current user. The optimization approach is to store only small subset of neighbors for every item (e.g. ~ 200 items, for k~ 30), where this "model size">>k. Thus, eventual model is still realtively compact (it is approximately n/model-size smaller, than full similarity matrix, where n is a number of items). Such approach allow this method to be used by such companies as Amazon, as it drastically decreases the model size. (Btw, issue #3 reported about this very problem - current model's memory footprint is too big, because it keeps excessive similarities) So, for item-item classification we may keep current approach for looking up any neighbors, but in order to make it feasible for production purposes - we shall confine model size by keeping only a subset of sim matrix, defined by "model size"/"neighborhood size" param.
Nicolas, the fixed-k approach makes it possible to compute top-N predictions much more effectively, by looking only at the items that min number of fixed neighbors have jointly estimated. So, I offer to introduce the get_top_n method to kNN algs, which will be available in case the new option is set to True. It will have drastically improved performance over the one you propose in FAQ (giving predictions for all the items)
k neighbors are absolutely fixed in the method that I advocate.
I understand what you mean: they're fixed in the sense that for a given user,we will always look at the same set neighbors, which is not the case of the current implem where we first filter out users that have not rated the target item.
However, with the method you propose, the actual set of neighbors that we use for a prediction is not fixed, because the subset of neighbors that have rated the target item is never the same. Let N(u) be the k nearest neighbors of u, regardless of their rating for the target item. With the new method, we use a subset of N(u) to make a prediction: only the users in N(u) that have rated i can be used. For a different item (j), this subset of neighbors will be different.
This is why I think fix_k_closest or fix_k_neighbors or fixed_k is misleading.
Thanks for the TLDR.
Here is what I understand, please let me know if I got it right or wrong (I consider here for simplicity that we have a user-user similarity):
- The model size (l) is the number of stored neighbors for each user. Let us denote N(u) this set for each user. |N(u)| = l for all users.
- To make a prediction for user u, item i: we look at k < l users in N(u) that have rated i. Obviously we may find less than k users in N(u) that have rated the target item: this number is k' and is the actual number of neighbors that are used for the prediction.
- This view is a generalization of the current implementation and the proposed change:
- current technique is equivalent to having l = |U| (where U is the set of all users)
- proposed changed is equivalent to having l = k << |U|.
Is that correct, or am I missing something?
Note that issue #3 was about similarity computation using too much memory. This will not be fixed by the proposed changes: the similarity matrix has to be computed anyway, even if it's ultimately freed.
I'm not sure I completely understand what you mean about top-N but honestly I'd rather not focus on that right now as it's fairly unrelated to the current matter.
Nicolas
You got it right with model size, and wrong with fixed k:
- fixed-k approach is generally used only for u-u CF. And for it - model size parameter (l) doesn't make sense at all - it's enough to store only k fixed nearest neighbors. For any item prediction is made as a weighted average among those fixed k neighbors, but no less than min of them should have rated it. So, here the "fixed_k" param is applicable, and it correctly conveys the point.
- for item-item CF - approach for finding neighbors remains the same as is currently implemented, i.e. k neighbor item are used. But to compress the model - the approach is to store only l neighbors.
To conclude - we need a "fix_k_closest" param to implement classical u-u approach, and "model_size" param to implement more efficient i-i implementation.
regarding top_N prediction - you proposed to do it as predicting score for each item, and then sorting out highest predictions. While in user-based kNN - we actually won't be able to predict rating for a great share of items. And computationally cheaper would be: first select only a subset of items, jointly rated by min of fixed kNN, and then make predictions only for those items. So, for user-user CF you don't need trying to predict all ratings, and approach will be more computationally effective.
See my PR, probably it would be easier to get an idea for user-based fixed-k approach from code.