hnswlib icon indicating copy to clipboard operation
hnswlib copied to clipboard

Applicability of HNSW for approximate sorting

Open jwimberley opened this issue 4 years ago • 7 comments

A general question rather than an issue, really ---

I've been experimenting with the use of HNSW for a (very slightly) different problem than k-nearest neighbors, and wanted to ask whether it's really within HNSW's comfort zone, so to speak. The task is, given a dataset of N points, to be able to traverse the dataset in approximate nearest-order from many arbitrary starting points in the space. Thus, "k" is not really predetermined, and could be anywhere between 1 and N, depending on whether the search finds what it wants before it exhausts the dataset -- which I expect (in my use case) to be quite frequent.

As a first step, I've adapted the Hierarchical::searchKnn method to return an approximately sorted vector of all points in the dataset, in a method approxSortKnn in this forked branch:

https://github.com/jwimberley/hnswlib/tree/approx_graph_sorting

Essentially the brute force modification overloads searchBaseLayerST with a similar method that bounds the size of the priority_queue of top candidates by Ef, and pops out the excess elements (ordered by negative distance) into the quasi-sorted return vector. Based on some tests this appears to work, but certainly might be flawed due to misunderstandings of HNSW on my part. A more complete implementation would be a visitor method that would apply a user supplied functor to each candidate in the search and send a termination signal.

Is this a valid task that HNSW could be used for? Or do you think other approaches would more suited to traversing nearly the full dataset from nearest-to-farthest neighbors?

The forked branch I linked to above includes some trivial tests running and timing this approximate sort. I note in the commit logs that this method is currently slower than fully sorting the dataset using std::sort, but I don't think that's very fair comparison just yet since

  • my brute force modification couple probably be made more efficient
  • having to traverse all N points in the dataset is the worst-case, so these timings don't model the gain when actually the search would terminate at k << N
  • my use case's datasets are 2 or 3 dimensional, and on the smaller side (10k to 100k), which are perhaps not large enough for HNSW's advantages to be realized?

jwimberley avatar Jul 12 '20 22:07 jwimberley

What's exactly sorting here?

searchivarius avatar Jul 12 '20 23:07 searchivarius

Sorry, I was imprecise -- I used "sorting" above to mean "traversing a fixed set of N points in d-dimensional space in approximately increasing order of Euclidean distance from an arbitrary point in d-dimensional space." In my current experimental branch, the "sort" method just returns an std::vector of length N containing the indices of the dataset points, for the client to traverse through, but this is obviously wasteful if the client finds a matching result early in the traversal order.

jwimberley avatar Jul 13 '20 02:07 jwimberley

@jwimberley Well, it seem to make sense to me. Can you give the context? What is the application?

yurymalkov avatar Jul 13 '20 06:07 yurymalkov

@yurymalkov I can give a general context. The problem is to search the dataset for a point x that is both close to a test point T (the closer the better, without having to be exact) and satisfies some arbitrary boolean condition S(x,t). For example, this could be just a generic search routine to find a nearby neighbor satisfying some extra property; or a heuristic for a min-cost assignment problem, where there's a bipartite graph with left nodes T and right nodes X, and an edge from t to x either does not exist or has a cost related to the distance between t and x, with the nodes x perhaps being capacitized.

@searchivarius Thanks for the suggestion! If I'm reading it correctly, this would amount to running HierarchicalNSW::searchKnn with k=N, or are you suggesting something different? The main reason why didn't think this was workable, originally, is that I thought that the final step of inserting all N elements into a priority queue (and thus sorting them completely) would make the method no better than just sorting them from the get-go. (That was the motivation behind my implementation I linked to, where I change the return type from a priority queue to a vector). However, I think now I might have been mistaken about that. My understanding of the implementation is that elements are inserted into top_candidates here

https://github.com/nmslib/hnswlib/blob/3c6a84fb54a2a92dc4b7dc3e13dbae3d99892444/hnswlib/hnswalg.h#L318-L319

in some close-to-sorted order, because the candidates are coming from the underlying graph traversal, and contrary to what I thought originally this is more efficient that inserting elements into std::priority_queue in a random order.

jwimberley avatar Jul 13 '20 14:07 jwimberley

@jwimberley If you need to do two things - do knn search and statisfy a boolean condition, why not to apply the condition to the existing element deletion logic? E.g. you can treat the elements that do not satisfy the condition as deleted. I think this can be checked by overriding the isMarkedDeleted method.

yurymalkov avatar Jul 14 '20 02:07 yurymalkov

The boolean condition varies on each query, so the overhead of deleting and undeleting elements before each query wouldn't be worth it, I think.

Based on some of the dialogue here, I'm going to try two directions:

  • try to complete my branch with an "dynamic k" NN search where a client supplied functor returns a boolean indicating whether it needs to see more neighbors, which are passed to it from an intermediate fixed-size priority queue
  • taking a suggestion from @searchivarius for a heuristic: doing 1-nn neighbor and then looking up a cached permutation of the N points ordered by distance from the nearest neighbor; this won't be as accurate in terms of the order of neighbors but will be pretty high performant, with O(N^2) memory for the cached permutation and just O(log(N)) to pick the permutation, no matter how many entries k on the permutation the client actually looks at.

Thank you both for your suggestions!

jwimberley avatar Jul 14 '20 13:07 jwimberley

@jwimberley Sure. Concerning the deletions - you do not need to "delete" the element, you can just change the method isMarkedDeleted to your boolean method.

yurymalkov avatar Jul 14 '20 16:07 yurymalkov