umap icon indicating copy to clipboard operation
umap copied to clipboard

[RFC] Feature: ParametricUMAP PyTorch implementation

Open leriomaggio opened this issue 3 years ago • 24 comments

In the spirit of the PR #578 which allows dependencies not to interfere too much with the test suite, I was wondering whether you would consider adding an extra dependency 😅

More seriously, I started working to an alternative PyTorch-based implementation for ParametricUMAP (will soon sync the branch on my fork)

If you would generally appreciate the idea and the contribution, I will elaborate better here my UEP (UMAP Enhancement Proposal) to have the two implementations co-existing, with the less impact possible in the API for backward compatibility.

leriomaggio avatar Feb 10 '21 20:02 leriomaggio

I would like to get @timsainb 's opinion on this. Personally I can see a lot of value in supporting other deep learning frameworks, and since dependencies can be optional I don't see and extra potential dependency as a problem. I do know that other have asked about a PyTorch version, so there is certainly interest. I think it would be a very exciting addition. The key would be to ensure it syncs up with @timsainb 's implementation in TF and doesn't cause (too many) weird interactions as to what happens under the different backends. It sounds like you have thought about that, so please feel free to elaborate on that -- I would like to hear more.

lmcinnes avatar Feb 11 '21 18:02 lmcinnes

I think that would be a great addition.

timsainb avatar Feb 11 '21 18:02 timsainb

If Tim is on board, then I think we can call this a go! Really looking forward to seeing this in place.

lmcinnes avatar Feb 11 '21 19:02 lmcinnes

Fantastic!! 🙌
Thanks so much @timsainb and @lmcinnes for getting back to me about this! Will submit my initial PR very soon so that we can start discussing design options there.

I will be sharing quick ideas here as well while I do work on it, so that we could potentially upvote/downvote options very quickly - will be starting soon in my next post!

Thank you so much :)

leriomaggio avatar Feb 12 '21 11:02 leriomaggio

Q: How do we allow multiple DL backend for Parametric UMAP (so far, tf and torch)

iow, what implementation will I import whenever I will do:

from umap.parametric_umap import ParametricUMAP

My idea would be in favour of a Convention of Configuration based approach - so no configuration file a-la-keras to choose the backend

General Idea

  • If torch is available but not tf --> go with torch
  • if tf is available but not torch --> go with tf
  • if tf and torch are both installed in the environment --> go by default with @timsainb / tf implementation (unless specified - as in importing the default torch-based class/implementation)

Similarly, whenever a custom encoder &| decoder parameter(s) are provided in the constructor, the choice of the backend will be directly derived from the type of the object.

Again, the goal is to be as less disruptive as possible with current APIs

What do you think? :)

leriomaggio avatar Feb 12 '21 11:02 leriomaggio

Sounds ideal to me. I really like the idea of having the least friction for users, and this would meet that goal. It may be worth providing a warning, or some kind of feedback to the user, in the fallback/default case, so that they know that a decision has been made without their input. That is, if both tf and torch are installed, and no custom encoder or decoder is provided, let the user know that tf will be used, and how they can force the torch option if they would prefer.

In practice I assume the main reasons to have a strong preference is that the user has one but not the other installed, or because they want to define their own network architecture (encoder/decoder) in their preferred framework. Are there other reasons to have a significant preference?

lmcinnes avatar Feb 12 '21 19:02 lmcinnes

@lmcinnes I'm currently running into problems with tensorflow's protobuf size limits when I try to run on larger datasets. Maybe pytorch won't have this issue? I haven't been able to find an easy workaround so far.

cyrusmaher avatar Feb 22 '21 18:02 cyrusmaher

@cyrusmaher I have the impression that this may depend from many different factors: e.g. available hardware architecture, or dataset size just to mention the two most obvious.

One thing for sure is that torch does not use protobuf so if you think your issue is related to that, it might not have the same issue. That being said, It is very hard to think of possible workaround or solution without having a clearer idea of the actual problem and errors. So, my personal piece of advice here, would be to open a specific issue on that - if at all possible - specifying the environment used to replicate the issue. In this way, it would be easier to understand what and where is the problem :)

HTH

leriomaggio avatar Feb 24 '21 20:02 leriomaggio

hey @leriomaggio . Is this still something you're thinking about? I've been playing around with pytorch and I'm sold on it's advantages over tensorflow. It seems like it might be a good idea to even switch the primary backend over to pytorch. From a user standpoint pytorch is just a bit easier to deal with. And it seems like pytorch lightning allows us the same level of abstraction as keras.

timsainb avatar May 19 '21 21:05 timsainb

Hey @timsainb thanks a lot for reaching out about this, and yes: I am very keen on continuing working on this, as I believe it would be a great addition! Unfortunately, I've got to stop as I got swallowed by lots of teaching recently, and so the end of the term. But I had indeed already planned to get back at this in the next few days.

I am happy that you share the same view on PyTorch, and I'd be happy to discuss any idea you might have (here, or anywhere you think would be best fit) about design. I was indeed keen on supporting pure PyTorch first, as I don't see lightning integration as necessary as Keras would be for Tensorflow (also in terms of adoption). However, happy to discuss about this too.

As for where I left: I think I started working on what is perhaps the most "intricate" part, as in porting the covariance calculation for the distance_loss_correlation as done in tensorflow_probability. There is no equivalent to tensorflow_probability for PyTorch, so I decided to resort re-implementing it from scratch: not entirely a big deal, only some headaches to have tensor dimensions in place, and making sure the same operations (and assumptions on data) would be the same.

The implementation should be in place, and some tests are there as well, although I was planning to complete the test and then merging the branch in my fork into a similar branch here to continue working on the implementation.

Here is the link to the implementation, for reference: parametric_umap/stats_utils.py

Please let me know what you think :) 👍

EDIT: I've just made sure that the latest (⚠️ WIP) are there. Happy to receive your feedback so far!

leriomaggio avatar May 20 '21 16:05 leriomaggio

Sounds great. I'm trying to tie up my thesis at the moment so I relate to juggling a whole lot of things at once.

I am happy that you share the same view on PyTorch, and I'd be happy to discuss any idea you might have (here, or anywhere you think would be best fit) about design. I was indeed keen on supporting pure PyTorch first, as I don't see lightning integration as necessary as Keras would be for Tensorflow (also in terms of adoption). However, happy to discuss about this too.

Sounds good to me, I'm new to pytorch so I'm sure you'd know better than me. If we can get this working without making people install anything new, all the better.

As for where I left: I think I started working on what is perhaps the most "intricate" part, as in porting the covariance calculation for the distance_loss_correlation as done in tensorflow_probability. There is no equivalent to tensorflow_probability for PyTorch, so I decided to resort re-implementing it from scratch: not entirely a big deal, only some headaches to have tensor dimensions in place, and making sure the same operations (and assumptions on data) would be the same. The implementation should be in place, and some tests are there as well, although I was planning to complete the test and then merging the branch in my fork into a similar branch here to continue working on the implementation.

That makes sense. I suppose I got lucky in finding tensorflow had already solved this. Your implementation looks great so far!

timsainb avatar May 22 '21 19:05 timsainb

Hi @leriomaggio and @timsainb, I just came across this because I was looking for exactly this feature. I need some more flexibility with the embedding topology (e.g. I need to embed on a torus but also learn the torus dimensions) and I'm used to working with PyTorch. So this would be a killer feature for me :)

Is this still in progress? Is there maybe even a beta version that I could try out? :)

robert-lieck avatar Aug 12 '21 21:08 robert-lieck

Hi @robert-lieck. I'm still trying to get my thesis finished and don't personally have the time to work on this. Very interested in seeing a torch implementation happen though!

For a basic implementation, all you need to do is translate this self-contained notebook to torch: https://colab.research.google.com/drive/1WkXVZ5pnMrm17m0YgmtoNjM_XHdnE5Vp?usp=sharing Probably every tensorflow function used there has a direct translation into torch, so I don't expect it would take too long. Writing your own implementation will give you the best flexibility too.

For the getting a version working in this repo: @leriomaggio 's implementation of the functions that tensorflow relies on tensorflow_probability for (from the comment above) look fantastic. The tensorflow_probability part is only needed for the global loss. To get a version that matches the tensorflow version, the next challenge is writing a robust data iterator for training, as well as an optimization loop to replace keras.fit. The rest of those tensorflow functions have a pretty direct translation in torch.

Beyond that, it

timsainb avatar Aug 13 '21 01:08 timsainb

Hi both @robert-lieck @timsainb First off, thanks very much @robert-lieck for reviving the topic: there is no day that passes that I wish I could get some time to get back at this! I very much want this to happen! I've been recently contributing to two other open-source projects and this unfortunately got a bit delayed, but I do really hope I could get back at where I left very soon.

Just bear with me another little bit: I promise I do really want this integration to happen! 💪

That said: all that @timsainb already mentioned is very correct and a starting point for sure!

leriomaggio avatar Aug 13 '21 16:08 leriomaggio

Thanks for your replies @timsainb and @leriomaggio

@timsainb thanks for linking the notebook and good luck with the thesis writing 💪

@leriomaggio This is great! I will be off in a couple of days until end of September but I might be able to do some work on this once I'm back.

One question for clarification: Does ParametricUMAP also allow for parameterising the output_metric? It looks like this is not the case at the moment. Having that part implemented (and thus trainable) via PyTorch would be really cool: (1) It would allow to learn the embedding topology and (2) the gradient computations would come for free (which now need to be manually implemented via numba).

I haven't dived deep enough into the code yet, but I guess there should be quite some synergies in tackling this in the same go, right?

robert-lieck avatar Aug 16 '21 12:08 robert-lieck

Hi everyone! @leriomaggio @timsainb @robert-lieck @lmcinnes. This thread has been dead for a while but I've recently ported a simplified parametric umap to pytorch for necessity using pytorch lightning for the training loop. Here is the repo: https://github.com/elyxlz/umap_pytorch. I did this just by following the notebook and translating to pytorch. However, I encountered a problem on my first test using MNIST. I successfuly do get some class separation, but not nearly as good as is shown in the notebook, this is despite using exactly the same hyperparameters. Could anyone that has more knowledge than me help me figure out why? Interestingly my loss converged much earlier than it did for the notebook (around 0.2 instead of 0.1). For reference here is the plot:

image

elyxlz avatar Dec 31 '22 00:12 elyxlz

Awesome to see someone working on this.

I've just looked briefly, but it looks like the convert_distance_to_probability differs from the original?

timsainb avatar Jan 02 '23 03:01 timsainb

I fixed the probability computation and it looks much better now: https://colab.research.google.com/drive/1CYxt0GD-Y2zPMOnJIXJWsAhr0LdqI0R6?usp=sharing

download

I'm not sure why the loss converges at 0.2 instead of 0.1, it might require some additional digging around to find more implementation differences.

timsainb avatar Jan 02 '23 04:01 timsainb

Thank you so much! That's amazing news that it works. So strangely enough the original notebook you shared has some differences to the implementation on GitHub, such as the distance to probability function and how cross entropy is computed. Maybe I should be comparing more to the github version to find why the loss isn't converging correctly. Thanks again! :)

elyxlz avatar Jan 02 '23 10:01 elyxlz

+1 this thread / idea. Would love to get it integrated with the core UMAP package. I hope this really happens.

ayaanhossain avatar Mar 05 '23 02:03 ayaanhossain

Really cool incremental development. I am also waiting for a pytorch implementation desperately. Rooting for you folks!

deborshigoswami95 avatar Mar 10 '23 02:03 deborshigoswami95

Hi all, thanks for sharing the collab notebook, I think a pytorch implementation would be a wonderful addition.

One comment/question about the code: In the UMAPDataset class, the length is defined as int(self.data.shape[0]), which is the 60,000 MNIST train digits - but shouldn't the dataset be the size of self.edges_to_exp? Otherwise during training we only ever iterate over the first 60,000 sampled pairs from the shuffle mask. You probably don't want to iterate over the whole shuffle mask, but would like to sample a different subset each epoch.

I don't think this solves the above mentioned loss discrepancy though.

jh83775 avatar Jun 09 '23 14:06 jh83775

Thanks so much @elyxlz @timsainb for this PyTorch implementation of Parametric UMAP. I'm in the process of adding it to our library (156 commits in to this branch :see_no_evil:), and having an implementation to adapt from has been really helpful.

Just want to add that I think @jh83775 is right -- you will want __len__ of the dataset to return self.edges_to_exp.shape[0], instead of self.data.shape[0].
You can verify this by setting __len__ to return some arbitrary small number (e.g. return 10) and seeing that training stops quickly.

I wonder if this "truncation" of the dataset explains the loss not converging all the way to 0.1 like you saw @elyxlz? Basically because we're not iterating through the entire dataset, so we can't drive the loss all the way down.

Also, this is a total nitpick, but I'm not sure if it's necessary to shuffle here? https://github.com/elyxlz/umap_pytorch/blob/04d3cc6923ee783fa69806728ee165b92cd6af73/umap_pytorch/data.py#L40

        shuffle_mask = np.random.permutation(np.arange(len(self.edges_to_exp)))
        self.edges_to_exp = self.edges_to_exp[shuffle_mask].astype(np.int64)
        self.edges_from_exp = self.edges_from_exp[shuffle_mask].astype(np.int64)

This might be a case of translating literally from keras idioms--I'm pretty sure PyTorch / lightning will handle the shuffling as long as shuffle is set to True for the Dataloader, as it is here: https://github.com/elyxlz/umap_pytorch/blob/04d3cc6923ee783fa69806728ee165b92cd6af73/umap_pytorch/main.py#L91 I haven't tested this by removing the shuffling though. Please correct me if there's something I'm missing.

NickleDave avatar Aug 14 '23 01:08 NickleDave

What is the current status on ParametricUMAP PyTorch? Is https://github.com/elyxlz/umap_pytorch the only public code?

Hackathorn avatar Feb 27 '24 18:02 Hackathorn