scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

UMAP implementation in igraph

Open iosonofabio opened this issue 1 year ago • 4 comments

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

dear devs,

Thank you for maintaining scanpy, which is a great piece of software.

I am a core developer of igraph and around a year ago we started providing an implementation of UMAP. It is accessible via the igraph Python package and the actual computations run entirely in C. This is designed to be an alternative to the original UMAP implementation that requires numba, llvmlite, etc. Since scanpy relies on igraph for clustering already, this might be useful for folks who have trouble with the dependency stack.

Just like Leiden clustering can now use either leidenalg or directly igraph, it would be nice to let people choose flavor='igraph' for embedding as well.

Our implementation requires as input a nearest neighbor graph with distances and has two parts, just like the original umap:

  1. a function to compute what are now called in scanpy "connectivities" from the distances. This is among other things a symmetrysation of the knn graph.
  2. a function to compute the embedding from the connectivity graph.

Would you be interested in this? If so, I can make a PR for scanpy and perhaps one of you can review it?

iosonofabio avatar Dec 18 '24 22:12 iosonofabio

Hello, I'd be glad to provide this enhancement and joining the project. Can you assign this issue to me? Thanks

Riccardo231 avatar Dec 26 '24 16:12 Riccardo231

Sure, I assigned you for now! Please note

  1. We have a contribution guide

  2. A similar PR exists that you can use as orientation: #2815

    Specifically, please add flavor='igraph' as a non-default option.

Regarding design, there are two considerations you can engage with if you want:

  • It’s worth it to investigate how to best include the two-step process around connectivities @iosonofabio mentioned. One way would be to have a kind of sklearn-like “transformer” that creates the connectivities on .fit() and the UMAP on .transform() if that makes sense.
  • Another consideration is #2157: Computing the umap from an existing distance matrix instead of raw data. If you end up going for the two-step solution, incorporating this would also make sense.

If you want to have a call with us about how to best design an API around all that (or about which parts you want to leave out), please tell us. If you already have a straightforward idea about what to do, please go ahead!

flying-sheep avatar Jan 09 '25 15:01 flying-sheep

Thanks both. Note that scanpy already has the two step process in place internally, and you could swap igraph for either or both steps.

Actually it's three steps, the first one being the creation of the knn graph. Scanpy already resected that chunk out of the official UMAP implementation for performance reasons.

Let me know if I can help

iosonofabio avatar Jan 09 '25 20:01 iosonofabio

Hi @ricor07, if you run into any questions regarding contributing, don’t hesitate to ask!

flying-sheep avatar Jul 30 '25 09:07 flying-sheep