pynndescent
pynndescent copied to clipboard
Nearest neighbors for a subset of points
I'm interested in the following use case: suppose I have a dataset X
with N point in it. I take a subsample Y
with M < N points.
Is it possible to calculate the nearest neighbors in X
of just the points in Y
? And would that be much faster than calculating all nearest neighbors (supposing M << N, say 10%)?
To ensure I am understanding correctly, you want to find, for each y in Y, the points {x_1, ..., x_k} in X that are closest to y? Presuming you have built a search index for X then that is fast, but the building search index often amounts to as much work as finding the nearest neighbours of every point in X. Since Y is a subset of X there are no gains there. On the other hand, with a small enough Y you can brute force the problem, calculating the distances from Y to X for MN total cost; after that it is just sorting which is cheap. Thus if M ~ log(N) this may be cheaper than building the full neighbour sets for all of X (which is, ballpark, about N log(N)). I suspect you actually want something a little smarter leveraging the NNDescent style approach, but I think that is quite hard to achieve since it really leverages the neighbour of a neighbour effect, and thus needs neighbours for all points in X to get real traction.
Possibly I am misunderstanding the question however.
Yup you have the setup exactly right! The intuition about needing the whole graph for a neighbor-of-neighbor method makes sense... I do wonder if there's some pruning approach where you can quickly discard most points in X are nowhere near any points in Y but it definitely seems nontrivial. Thanks!
I agree that a pruning approach may yield some benefits, but as you note, that is going to require some careful analysis as to how best to apply such a strategy. There is also the risk that, even with pruning, you aren't going to beat the MN cost of brute force. Obviously there is a crossover point as M grows, but exactly where that is is non-obvious to me. Regardless, it is worth some thought, but certainly I don't believe there are any simple obvious solutions other than the brute force approach.