pynndescent icon indicating copy to clipboard operation
pynndescent copied to clipboard

Nearest neighbors for a subset of points

Open anashwin opened this issue 3 years ago • 3 comments

I'm interested in the following use case: suppose I have a dataset X with N point in it. I take a subsample Y with M < N points.

Is it possible to calculate the nearest neighbors in X of just the points in Y? And would that be much faster than calculating all nearest neighbors (supposing M << N, say 10%)?

anashwin avatar Oct 22 '21 06:10 anashwin

To ensure I am understanding correctly, you want to find, for each y in Y, the points {x_1, ..., x_k} in X that are closest to y? Presuming you have built a search index for X then that is fast, but the building search index often amounts to as much work as finding the nearest neighbours of every point in X. Since Y is a subset of X there are no gains there. On the other hand, with a small enough Y you can brute force the problem, calculating the distances from Y to X for MN total cost; after that it is just sorting which is cheap. Thus if M ~ log(N) this may be cheaper than building the full neighbour sets for all of X (which is, ballpark, about N log(N)). I suspect you actually want something a little smarter leveraging the NNDescent style approach, but I think that is quite hard to achieve since it really leverages the neighbour of a neighbour effect, and thus needs neighbours for all points in X to get real traction.

Possibly I am misunderstanding the question however.

lmcinnes avatar Oct 22 '21 17:10 lmcinnes

Yup you have the setup exactly right! The intuition about needing the whole graph for a neighbor-of-neighbor method makes sense... I do wonder if there's some pruning approach where you can quickly discard most points in X are nowhere near any points in Y but it definitely seems nontrivial. Thanks!

anashwin avatar Oct 22 '21 18:10 anashwin

I agree that a pruning approach may yield some benefits, but as you note, that is going to require some careful analysis as to how best to apply such a strategy. There is also the risk that, even with pruning, you aren't going to beat the MN cost of brute force. Obviously there is a crossover point as M grows, but exactly where that is is non-obvious to me. Regardless, it is worth some thought, but certainly I don't believe there are any simple obvious solutions other than the brute force approach.

lmcinnes avatar Oct 22 '21 18:10 lmcinnes