umap Torch implementation of ParametricUMAP

Hi All, I've popped in a pull request with a version of ParametricUMAP written in PyTorch. It mostly follows the discussions in the threads below:

https://github.com/lmcinnes/umap/issues/580
https://colab.research.google.com/drive/1CYxt0GD-Y2zPMOnJIXJWsAhr0LdqI0R6

I'm not quite sure where it best fits in the codebase - currently I've just put it in umap/torch.py so you can import with from umap.torch import ParametricUMAP, which feels intuitive. It should be fairly easy to edit and include a nonparametric UMAP as well, via the torch.nn.Embedding class, but I can put in a fresh pull request for that when I find the time. There could also be a discussion around what parameters are accessible to the user while not bloating the number of optional args. For example, in my implementation the user can manually change the batch size and learning rate, but not the optimizer or negative sample rate. What is included in the optional args is relatively arbitrary based on what I tended to be changing the most for my use case. It will be relatively easy to expose any currently hardcoded parameter choices to the user though.

I've also added a few examples in a Jupyter notebook (see notebooks/Parametric_UMAP/08.0-torch-parametric-umap.ipynb) to give an indication of how to get started.

If any PyTorch experts want to take a look, I'll gladly add in any optimizations (I'm sure there will be a few things here and there that can be changed and improve the runtime).

I've tested it with pytorch==1.12.1 & CUDA==11.6. I think it should work with other versions too, but haven't tested more widely.

Mar 23 '24 13:03 jh83775

Hello @jh83775! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file umap/torch.py:

Line 297:1: W293 blank line contains whitespace

Comment last updated at 2024-03-26 11:59:02 UTC

Mar 23 '24 13:03 pep8speaks

Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101

Mar 23 '24 13:03 lmcinnes

Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101

That looks helpful, thanks. I'll update to reflect the discussions on that thread. Looks like the keras-pytorch version might supersede this work, although hopefully this can still be useful for those more familiar with torch than keras

Mar 23 '24 13:03 jh83775

I think there's still merit in this; I just don't have a good sense of where things will settle to know what the right way to support the various options will be.

On Sat, Mar 23, 2024 at 9:37 AM jh83775 @.***> wrote:

Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101 https://github.com/lmcinnes/umap/pull/1101

That looks helpful, thanks. I'll update to reflect the discussions on that thread. Looks like the keras-pytorch version might supersede this work, although hopefully this can still be useful for those more familiar with torch than keras

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/pull/1103#issuecomment-2016497809, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBIK675PBJBNBEXQC53YZWATDAVCNFSM6AAAAABFEUQLBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGQ4TOOBQHE . You are receiving this because you commented.Message ID: @.***>

Mar 23 '24 13:03 lmcinnes

So as far as I can tell this will just sit alongside the existing Keras3 implementation (which should now support a pytorch backed) just fine as an extra pytorch specific ParametricUMAP. I would be happy to just merge this if you think it has sufficient value as-is for a standalone pytorch speciifc implementation.

Apr 04 '24 22:04 lmcinnes

So as far as I can tell this will just sit alongside the existing Keras3 implementation (which should now support a pytorch backed) just fine as an extra pytorch specific ParametricUMAP. I would be happy to just merge this if you think it has sufficient value as-is for a standalone pytorch speciifc implementation.

I'm also glad to see this implementation. Thanks!

I do think the overlap in functionality between this and the keras implementation is pretty high.

This code is missing some key functionality (it's been a few days since I read it). One of the benefits of parametric UMAP is it allows you to balance the global MDS loss and auto encoder loss, which iirc are not in this implementation.

One thing that the keras code is missing that this code has is the data iterator. Currently to run the keras code, even if you use torch as a backend you still rely on on the tensorflow dataset, meaning you need to have keras, tensorflow, and torch installed. If we pull the iterator from this code into the keras implementation we will overcome that issue (this is something I've been meaning to do but haven't had time yet)

Apr 04 '24 22:04 timsainb

@timsainb as per discussion on HN by Max Woolf, one big limitation of the current parametric UMAP implementation is OOM. Caused by tensorflow dataset pulling the entire graph into memory.

So a parametric UMAP implementation that works iterating over the dataset would be great.

May 09 '24 17:05 turian

@turian I think this is out of the scope of the umap python library, but I give an example of how to build your own graph and iterator here: https://colab.research.google.com/drive/1WkXVZ5pnMrm17m0YgmtoNjM_XHdnE5Vp?usp=sharing

You could switch out the graph and the interator for one that supports a graph larger than memory and an iterator that grabs from that graph.

May 15 '24 16:05 timsainb