umap Theoretical question about scaling to an arbitrary number of points

I am wondering if the following line of thinking leads anywhere, but please ignore if it doesn't. :)

Let's say we have very large dataset that will not fit in memory and we want to run umap. The first step normally (unless I'm misunderstanding) would be to create a knn graph. This graph then is used as "truth" to train the embedding: the final 2-dimensional representation (again correct me if I'm wrong).

The second part, training the embedding, could be very amenable to a minibatched approach (like parametric umap already does). But for the first step, you still need that full knn graph, and that is the annoying part (right?).

I was wondering... are there any arguments we could make about knn graphs and downsampled knn graphs that could allow us to not start with the full knn graph, but instead start (perhaps iteratively) with mini knn graphs created from only a subset of the data at a time?

(I have naively tried the following in minibatches -- (1) make a knn graph using only the minibatch, (2) update params of a parametric neural network embedding using the data from the minibatch -- but perhaps unsurprisingly, the results are not as good as real umap on the full dataset.)

But recently I've been wondering if there might exist some mathematical relationship between a knn graph of all datapoints and a knn graph of some random subset of those datapoints, which would allow us to turn that naive minibatch-only-knn approach into something that results in a provable approximation of the real umap when you're done.

Wondering if you have any thoughts @lmcinnes ? Thanks for your time. :)

Feb 11 '25 19:02 sjfleming

I think there are some possibilities in doing this, but obviously there are complications, and I think getting it just right might be challenging -- I have no insight into exactly how to adjust things. Alternatively I do have a distributed/out-of-core approach to computing a knn-graph, which I think would solve the problem and is likely a more direct path. I haven't had the time or opportunity to revisit that in depth and get it polished for release, but there should be a "distributed" branch of pynndescent: https://github.com/lmcinnes/pynndescent/tree/distributed it is quite a ways behind main, but it does have requite code to run in a dask distributed setup, and should be similarly amenable to an out-of-core approach. That may be an alternative path you could look into.

On Tue, Feb 11, 2025 at 2:57 PM Stephen Fleming @.***> wrote:

I am wondering if the following line of thinking leads anywhere, but please ignore if it doesn't. :)

Let's say we have very large dataset that will not fit in memory and we want to run umap. The first step normally (unless I'm misunderstanding) would be to create a knn graph. This graph then is used as "truth" to train the embedding: the final 2-dimensional representation (again correct me if I'm wrong).

The second part, training the embedding, could be very amenable to a minibatched approach (like parametric umap already does). But for the first step, you still need that full knn graph, and that is the annoying part (right?).

I was wondering... are there any arguments we could make about knn graphs and downsampled knn graphs that could allow us to not start with the full knn graph, but instead start (perhaps iteratively) with mini knn graphs created from only a subset of the data at a time?

(I have naively tried the following in minibatches -- (1) make a knn graph using only the minibatch, (2) update params of a parametric neural network embedding using the data from the minibatch -- but perhaps unsurprisingly, the results are not as good as real umap on the full dataset.)

But recently I've been wondering if there might exist some mathematical relationship between a knn graph of all datapoints and a knn graph of some random subset of those datapoints, which would allow us to turn that naive minibatch-only-knn approach into something that results in a provable approximation of the real umap when you're done.

Wondering if you have any thoughts @lmcinnes https://github.com/lmcinnes ? Thanks for your time. :)

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/1185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBICIMAHWTJHORYA5ED2PJIZZAVCNFSM6AAAAABW55B3AOVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA2DMMZYGAYDQMA . You are receiving this because you were mentioned.Message ID: @.***> [image: sjfleming]sjfleming created an issue (lmcinnes/umap#1185) https://github.com/lmcinnes/umap/issues/1185

I am wondering if the following line of thinking leads anywhere, but please ignore if it doesn't. :)

Let's say we have very large dataset that will not fit in memory and we want to run umap. The first step normally (unless I'm misunderstanding) would be to create a knn graph. This graph then is used as "truth" to train the embedding: the final 2-dimensional representation (again correct me if I'm wrong).

The second part, training the embedding, could be very amenable to a minibatched approach (like parametric umap already does). But for the first step, you still need that full knn graph, and that is the annoying part (right?).

I was wondering... are there any arguments we could make about knn graphs and downsampled knn graphs that could allow us to not start with the full knn graph, but instead start (perhaps iteratively) with mini knn graphs created from only a subset of the data at a time?

(I have naively tried the following in minibatches -- (1) make a knn graph using only the minibatch, (2) update params of a parametric neural network embedding using the data from the minibatch -- but perhaps unsurprisingly, the results are not as good as real umap on the full dataset.)

But recently I've been wondering if there might exist some mathematical relationship between a knn graph of all datapoints and a knn graph of some random subset of those datapoints, which would allow us to turn that naive minibatch-only-knn approach into something that results in a provable approximation of the real umap when you're done.

Wondering if you have any thoughts @lmcinnes https://github.com/lmcinnes ? Thanks for your time. :)

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/1185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBICIMAHWTJHORYA5ED2PJIZZAVCNFSM6AAAAABW55B3AOVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA2DMMZYGAYDQMA . You are receiving this because you were mentioned.Message ID: @.***>

Feb 12 '25 05:02 lmcinnes

Thanks very much!

Feb 12 '25 16:02 sjfleming