Torch implementation of ParametricUMAP
Hi All, I've popped in a pull request with a version of ParametricUMAP written in PyTorch. It mostly follows the discussions in the threads below:
- https://github.com/lmcinnes/umap/issues/580
- https://colab.research.google.com/drive/1CYxt0GD-Y2zPMOnJIXJWsAhr0LdqI0R6
I'm not quite sure where it best fits in the codebase - currently I've just put it in umap/torch.py so you can import with from umap.torch import ParametricUMAP, which feels intuitive. It should be fairly easy to edit and include a nonparametric UMAP as well, via the torch.nn.Embedding class, but I can put in a fresh pull request for that when I find the time. There could also be a discussion around what parameters are accessible to the user while not bloating the number of optional args. For example, in my implementation the user can manually change the batch size and learning rate, but not the optimizer or negative sample rate. What is included in the optional args is relatively arbitrary based on what I tended to be changing the most for my use case. It will be relatively easy to expose any currently hardcoded parameter choices to the user though.
I've also added a few examples in a Jupyter notebook (see notebooks/Parametric_UMAP/08.0-torch-parametric-umap.ipynb) to give an indication of how to get started.
If any PyTorch experts want to take a look, I'll gladly add in any optimizations (I'm sure there will be a few things here and there that can be changed and improve the runtime).
I've tested it with pytorch==1.12.1 & CUDA==11.6. I think it should work with other versions too, but haven't tested more widely.
Hello @jh83775! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
- In the file
umap/torch.py:
Line 297:1: W293 blank line contains whitespace
Comment last updated at 2024-03-26 11:59:02 UTC
Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101
Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101
That looks helpful, thanks. I'll update to reflect the discussions on that thread. Looks like the keras-pytorch version might supersede this work, although hopefully this can still be useful for those more familiar with torch than keras
I think there's still merit in this; I just don't have a good sense of where things will settle to know what the right way to support the various options will be.
On Sat, Mar 23, 2024 at 9:37 AM jh83775 @.***> wrote:
Thanks for this. There is currently work underway to support Keras 3 (the initial support is done) such that multiple backends, including pytorch, could be used. You may want to check in on the refactoring discussion on PR #1101 https://github.com/lmcinnes/umap/pull/1101
That looks helpful, thanks. I'll update to reflect the discussions on that thread. Looks like the keras-pytorch version might supersede this work, although hopefully this can still be useful for those more familiar with torch than keras
— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/pull/1103#issuecomment-2016497809, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUBIK675PBJBNBEXQC53YZWATDAVCNFSM6AAAAABFEUQLBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJWGQ4TOOBQHE . You are receiving this because you commented.Message ID: @.***>
So as far as I can tell this will just sit alongside the existing Keras3 implementation (which should now support a pytorch backed) just fine as an extra pytorch specific ParametricUMAP. I would be happy to just merge this if you think it has sufficient value as-is for a standalone pytorch speciifc implementation.
So as far as I can tell this will just sit alongside the existing Keras3 implementation (which should now support a pytorch backed) just fine as an extra pytorch specific ParametricUMAP. I would be happy to just merge this if you think it has sufficient value as-is for a standalone pytorch speciifc implementation.
I'm also glad to see this implementation. Thanks!
I do think the overlap in functionality between this and the keras implementation is pretty high.
This code is missing some key functionality (it's been a few days since I read it). One of the benefits of parametric UMAP is it allows you to balance the global MDS loss and auto encoder loss, which iirc are not in this implementation.
One thing that the keras code is missing that this code has is the data iterator. Currently to run the keras code, even if you use torch as a backend you still rely on on the tensorflow dataset, meaning you need to have keras, tensorflow, and torch installed. If we pull the iterator from this code into the keras implementation we will overcome that issue (this is something I've been meaning to do but haven't had time yet)
@timsainb as per discussion on HN by Max Woolf, one big limitation of the current parametric UMAP implementation is OOM. Caused by tensorflow dataset pulling the entire graph into memory.
So a parametric UMAP implementation that works iterating over the dataset would be great.
@turian I think this is out of the scope of the umap python library, but I give an example of how to build your own graph and iterator here: https://colab.research.google.com/drive/1WkXVZ5pnMrm17m0YgmtoNjM_XHdnE5Vp?usp=sharing
You could switch out the graph and the interator for one that supports a graph larger than memory and an iterator that grabs from that graph.