scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution

Open joe-jhou2 opened this issue 9 months ago • 13 comments

Description

Added cluster_resolution_finder and cluster_decision_tree as new tools and plotting functions, respectively, with corresponding tests. These functions enable hierarchical clustering analysis across multiple resolutions and visualization of cluster relationships as a decision tree.

This feature requires the igraph library for graph-based plotting in cluster_decision_tree and for Leiden clustering in cluster_resolution_finder. igraph is included in the leiden extra, which can be installed with:

pip install scanpy[leiden]
  • [x] Closes #3533
  • [x] Tests included or not required because:
  • [x] Release notes not necessary because:

joe-jhou2 avatar Mar 28 '25 05:03 joe-jhou2

Thanks, this looks nice! Do you need help with the tests?

flying-sheep avatar Mar 28 '25 09:03 flying-sheep

How does this implementation compare to https://github.com/complextissue/pyclustree ?

Zethson avatar Apr 11 '25 14:04 Zethson

That looks good! I hadn't come across it when I created this feature. Interestingly, we were both inspired by the R package clustree. It seems my function includes more features, most notably the ability to identify the genes driving the cluster splits in the tree. This significantly enhances the usefulness and interpretability of the splitting tree.

joe-jhou2 avatar Apr 11 '25 14:04 joe-jhou2

Hi @joe-jhou2, author of pyclustree here. Thanks for highlighting some of the differences in feature sets between the implementations. Whether one is “more” feature-rich than the other really depends on the intended use case and would not be the language I’d lean on, but I agree that the visualization of top genes differing between child clusters is a distinction and I am sure that it can potentially offer insights while saving an explicit call to rank_genes_groups for comparisons of interest if the focus is on the top differing genes. Regarding this PR, I noticed that the rank_genes_groups calls don’t currently expose user-configurable parameters. Allowing **kwargs to be passed through might be helpful for users who want to align this with other analyses, e.g., choosing different test statistics or multiple testing corrections.

maltekuehl avatar Apr 11 '25 16:04 maltekuehl

Hi @joe-jhou2, author of pyclustree here. Thanks for highlighting some of the differences in feature sets between the implementations. Whether one is “more” feature-rich than the other really depends on the intended use case and would not be the language I’d lean on, but I agree that the visualization of top genes differing between child clusters is a distinction and I am sure that it can potentially offer insights while saving an explicit call to rank_genes_groups for comparisons of interest if the focus is on the top differing genes. Regarding this PR, I noticed that the rank_genes_groups calls don’t currently expose user-configurable parameters. Allowing **kwargs to be passed through might be helpful for users who want to align this with other analyses, e.g., choosing different test statistics or multiple testing corrections.

Hi @maltekuehl, It was a bit of virtual handshaking — I agree, there’s really no direct comparison in terms of which implementation has more features. We just happened to converge on the same idea, fortunately.

My initial idea/intention was very straightforward, aimed at providing a quick and intuitive visualization to help answer questions from bench scientists, like why we chose resolution A over B.

Your package is much more sophisticated and thoughtfully designed. I really appreciate your suggestion — I’ll make sure to include **kwargs.

joe-jhou2 avatar Apr 11 '25 17:04 joe-jhou2

OK, so how about you two collaborate and make the package the best they can be (i.e. implement features this PR has there), and we advertise its use in scanpy tutorials?

That way I don’t have to maintain it, but more people get to see it?

flying-sheep avatar Apr 13 '25 09:04 flying-sheep

From the pyclustree side, we would definitely be open to that! For some context, pyclustree is part of a broader effort we're working on, but we decided to release this component early since we thought it could already be helpful to others. One important difference is that the implementation in this PR uses igraph (with networkx as a fallback), while in pyclustree we've intentionally kept dependencies light by only using networkx for layout. This does come at the cost of some of the visual niceties, like the Bezier curves shown in this PR, which we agree look great (and if scanpy is going to depend on igraph in a future release for Leiden clustering, it might not matter, though networkx potentially also offers interactivity via holoviews which we are currently evaluating). That said, the cluster_resolution_finder and find_cluster_specific_genes functions would be fantastic additions to the plotting tools in pyclustree, and it would be great to incorporate them. We’re completely aligned on the idea that having a single, well-maintained reference for clustering tree functionality within the scverse ecosystem would be best for the community. We're definitely open to finding common ground on the implementation side and happy to help maintain a unified solution in the long run.

EDIT: @joe-jhou2, if you would like to connect (e.g., schedule a meeting), feel free to reach out via mail to malte [dot] kuehl [at] clin [dot] au [dot] dk. We are on Central European time.

maltekuehl avatar Apr 14 '25 09:04 maltekuehl

if scanpy is going to depend on igraph in a future release for Leiden clustering, it might not matter

FYI: We won’t ever hard-depend on GPL libraries, no matter if we recommend them. If we did, scanpy could effectively not be used under any license terms other than the GPL.

networkx potentially also offers interactivity via holoviews which we are currently evaluating

I’m almost completely sure that scanpy will switch to holoviews in 2025, it offers too much for us to stick with matplotlib in the long term.

Some tools that can’t be supported by holoviews for now might stay in matplotlib for the time being though.

We're definitely open to finding common ground on the implementation side and happy to help maintain a unified solution in the long run.

that’d be lovely!

flying-sheep avatar Apr 14 '25 12:04 flying-sheep

OK, so how about you two collaborate and make the package the best they can be (i.e. implement features this PR has there), and we advertise its use in scanpy tutorials?

That way I don’t have to maintain it, but more people get to see it?

Thank you both for recognizing my contribution and showing interest in the visualization approach. I believe the most straightforward way to contribute to pyclustree would be to include find_cluster_specific_genes—without modifying the core structure of pytree—as it nicely complements the cluster tree splitting logic.

That said, I’ll be stepping away for the next six months due to a major life event and won’t be able to commit to further development during that time. For that reason, I’d really appreciate getting this PR merged so I can wrap up the task before my break.

@maltekuehl -- I’d love to reach out to you later for a meeting call to explore the possibility of collaborating, or at the very least, see if find_cluster_specific_genes could be picked up and adapted into the package if it fits.

joe-jhou2 avatar Apr 15 '25 22:04 joe-jhou2

@flying-sheep I refactor the 'cluster_decision_tree' function into class. but I'm in the middle of a tightrope walk, I need import networkx, ruff doesn't allow it into type checking block, but 'deferred import test' won't allow me import that earlier. Can you advise how to deal with dilemma? Thanks.

joe-jhou2 avatar Apr 17 '25 22:04 joe-jhou2

sure! we need to import it both in all functions using it and in the TYPE_CHECKING block.

I did that for you: 44cc72fd378e81b50749a838e32695b6b30899b3

flying-sheep avatar Apr 22 '25 12:04 flying-sheep

sure! we need to import it both in all functions using it and in the TYPE_CHECKING block.

I did that for you: 44cc72f

It passes almost all checks, but fail milestone label check. Can you advice that? Thanks.

joe-jhou2 avatar Apr 22 '25 16:04 joe-jhou2

I thought the way forward ws to have your contributions end up in pyclustree and have us referring people to that.

Things will probably be better maintained there, since I’m planning to phase out matplotlib code in scanpy and instead go all-in with holoviews.


Regarding the release notes, see https://scanpy.readthedocs.io/en/stable/dev/documentation.html#adding-to-the-docs

hatch run towncrier:create 3532.feature.md

Please add your name in the same way as other people added theirs in other news fragment files next to yours.

flying-sheep avatar May 16 '25 12:05 flying-sheep