hnswlib
hnswlib copied to clipboard
Applicability of HNSW for maximum inner product search
When specifying the option space='ip', the same algorithm that is being used for 'l2' distance is used? Or that there is a reduction to a metric space and then HNSW is applied?
Thanks in advance, Ori
No reduction is necessary. In fact, it can be detrimental: Morozov, Stanislav, and Artem Babenko. "Non-metric similarity graphs for maximum inner product search." Advances in Neural Information Processing Systems. 2018.
Thank you for the quick response! I am aware of this paper and actually I asked this question because I saw that Morozov et al. compared their algorithm with a baseline called "NSW+reduction". From your experience, this reduction is redundant\inadequate, or that it might yield better performances? Thanks again.
@OriKatz I have actually had my own experiments (on very different data) with the same reduction and it didn't work out well for me either. Bugs are always possible though.
@OriKatz Note that it is possible to improve the speed of inner product search by removing from the dataset the elements which are not the closest to themselves.
@yurymalkov what if you have a lot of such points:
@searchivarius I think that means big speedup. Those elements are useless.
Thanks a lot for the comments(:
@yurymalkov Regarding your suggestion - the dataset I'm considering is originated from a collaborative filtering model (users and items embeddings), and some of the elements may be highly popular (popular items). Therefore I'm concerned that most of the elements are not most similar to themselves. Had you ever tried applying HNSW equipped with 'inner-product similarity for this kind of data?
@OriKatz I would think that graph-based retrieval is a very stable approach in general. It often works very well on weird datasets. However, for the popular-element problem there might be another speed-up solution though it requires some changing of the code.
@OriKatz I've applied it to a simulated data. It worked reasonably good, although the result depended on the order of data. @searchivarius I think the best solution is just to remove useless elements. If there is good query source, it can be used to additionally remove some elements. I think placing popular elements at top layers was discussed during the work on the neurips paper, but probably it didn't help that much.
@yurymalkov I don't suggest placing popular elements in top layers :-)
@yurymalkov What do you mean by "depend on the order"? As far as I get popular items may cause hubs which can yield problems when navigating, therefore I assume that inserting the elements according to the inverse of the popularity is better. On the other hand, non-popular items might exhibit very noisy and unstructured embeddings.. so I'm not sure this is a good methodology for handling this kind of problem. Do you agree with these conjectures? or that my intuition\understanding is wrong?
@searchivarius What kind of code modification do you think that can aid when facing diversity in the popularity of certain elements?
@OriKatz I've some success building a graph using a slightly different metric than the original one. I used it in my thesis and in a follow-up publication. If the indexing metric is cosine similarity or some linear combination of cosine and inner product this may mitigate the problem. Whether it helps or not is an empirical question though. However, you need to use this combination at index time. Retrieval likely needs to be guided by the regular inner product.
PS: and there's always of course an option to index popular items separately in a second index.
@OriKatz In HNSW the out degree on each level is fixed, so the search does not see any hubs. There are in-bound hubs, leading to harder exploration of non-popular items, but I am not sure if it is a problem, as non-popular items might never end up as an answer even in the exact search. The worst thing that can happen in general is clusters, which require special parameters in HNSW, but this is orthogonal to inner product. @searchivarius Ok. Yes, I remember that :)
@yurymalkov ML fairness people would eat you alive 😃