hnswlib
hnswlib copied to clipboard
Applicability of HNSW for approximate sorting
A general question rather than an issue, really ---
I've been experimenting with the use of HNSW for a (very slightly) different problem than k-nearest neighbors, and wanted to ask whether it's really within HNSW's comfort zone, so to speak. The task is, given a dataset of N points, to be able to traverse the dataset in approximate nearest-order from many arbitrary starting points in the space. Thus, "k" is not really predetermined, and could be anywhere between 1 and N, depending on whether the search finds what it wants before it exhausts the dataset -- which I expect (in my use case) to be quite frequent.
As a first step, I've adapted the Hierarchical::searchKnn method to return an approximately sorted vector of all points in the dataset, in a method approxSortKnn in this forked branch:
https://github.com/jwimberley/hnswlib/tree/approx_graph_sorting
Essentially the brute force modification overloads searchBaseLayerST with a similar method that bounds the size of the priority_queue of top candidates by Ef, and pops out the excess elements (ordered by negative distance) into the quasi-sorted return vector. Based on some tests this appears to work, but certainly might be flawed due to misunderstandings of HNSW on my part. A more complete implementation would be a visitor method that would apply a user supplied functor to each candidate in the search and send a termination signal.
Is this a valid task that HNSW could be used for? Or do you think other approaches would more suited to traversing nearly the full dataset from nearest-to-farthest neighbors?
The forked branch I linked to above includes some trivial tests running and timing this approximate sort. I note in the commit logs that this method is currently slower than fully sorting the dataset using std::sort, but I don't think that's very fair comparison just yet since
- my brute force modification couple probably be made more efficient
- having to traverse all N points in the dataset is the worst-case, so these timings don't model the gain when actually the search would terminate at k << N
- my use case's datasets are 2 or 3 dimensional, and on the smaller side (10k to 100k), which are perhaps not large enough for HNSW's advantages to be realized?
What's exactly sorting here?
Sorry, I was imprecise -- I used "sorting" above to mean "traversing a fixed set of N points in d-dimensional space in approximately increasing order of Euclidean distance from an arbitrary point in d-dimensional space." In my current experimental branch, the "sort" method just returns an std::vector of length N containing the indices of the dataset points, for the client to traverse through, but this is obviously wasteful if the client finds a matching result early in the traversal order.
@jwimberley Well, it seem to make sense to me. Can you give the context? What is the application?
@yurymalkov I can give a general context. The problem is to search the dataset for a point x that is both close to a test point T (the closer the better, without having to be exact) and satisfies some arbitrary boolean condition S(x,t). For example, this could be just a generic search routine to find a nearby neighbor satisfying some extra property; or a heuristic for a min-cost assignment problem, where there's a bipartite graph with left nodes T and right nodes X, and an edge from t to x either does not exist or has a cost related to the distance between t and x, with the nodes x perhaps being capacitized.
@searchivarius Thanks for the suggestion! If I'm reading it correctly, this would amount to running HierarchicalNSW::searchKnn with k=N, or are you suggesting something different? The main reason why didn't think this was workable, originally, is that I thought that the final step of inserting all N elements into a priority queue (and thus sorting them completely) would make the method no better than just sorting them from the get-go. (That was the motivation behind my implementation I linked to, where I change the return type from a priority queue to a vector). However, I think now I might have been mistaken about that. My understanding of the implementation is that elements are inserted into top_candidates here
https://github.com/nmslib/hnswlib/blob/3c6a84fb54a2a92dc4b7dc3e13dbae3d99892444/hnswlib/hnswalg.h#L318-L319
in some close-to-sorted order, because the candidates are coming from the underlying graph traversal, and contrary to what I thought originally this is more efficient that inserting elements into std::priority_queue in a random order.
@jwimberley If you need to do two things - do knn search and statisfy a boolean condition, why not to apply the condition to the existing element deletion logic? E.g. you can treat the elements that do not satisfy the condition as deleted. I think this can be checked by overriding the isMarkedDeleted method.
The boolean condition varies on each query, so the overhead of deleting and undeleting elements before each query wouldn't be worth it, I think.
Based on some of the dialogue here, I'm going to try two directions:
- try to complete my branch with an "dynamic k" NN search where a client supplied functor returns a boolean indicating whether it needs to see more neighbors, which are passed to it from an intermediate fixed-size priority queue
- taking a suggestion from @searchivarius for a heuristic: doing 1-nn neighbor and then looking up a cached permutation of the N points ordered by distance from the nearest neighbor; this won't be as accurate in terms of the order of neighbors but will be pretty high performant, with O(N^2) memory for the cached permutation and just O(log(N)) to pick the permutation, no matter how many entries k on the permutation the client actually looks at.
Thank you both for your suggestions!
@jwimberley Sure.
Concerning the deletions - you do not need to "delete" the element, you can just change the method isMarkedDeleted
to your boolean method.