squidpy Add function for calculating niches

Description

Adds a function that calculates niches using different strategies. The initial function calculates niches based on neighborhood profiles similar to here.

This PR will get updated with methods discussed in https://github.com/scverse/squidpy/issues/789.

May 27 '24 16:05 LLehner

Codecov Report

Attention: Patch coverage is 17.71429% with 144 lines in your changes missing coverage. Please review.

Project coverage is 68.30%. Comparing base (df8e042) to head (bd16795). Report is 8 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #831      +/-   ##
==========================================
- Coverage   69.99%   68.30%   -1.69%     
==========================================
  Files          39       40       +1     
  Lines        5525     5707     +182     
  Branches     1029     1081      +52     
==========================================
+ Hits         3867     3898      +31     
- Misses       1363     1514     +151     
  Partials      295      295

Files	Coverage Δ
src/squidpy/gr/_niche.py	`17.71% <17.71%> (ø)`

... and 3 files with indirect coverage changes

Jul 15 '24 20:07 codecov-commenter

hi @LLehner I was just wondering what's the status of this? I see a lot of conflicts and tests still missing. No rush, just curious if there is a timeline

Sep 16 '24 16:09 giovp

@giovp I want to wrap it up this month. Tests will be added and if possible 1-2 more methods and the previous ones compared to their original implementations.

Sep 16 '24 17:09 LLehner

@giovp @timtreis PR is ready for review!

Some additional questions that came up, which you could have a look at:

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.
I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?
If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).
Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

Oct 10 '24 12:10 LLehner

Hey @LLehner

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.

You mean a scenario where the user is just storing multiple slides in the same sdata object, like the point8, point16, ´point24` dataset we have? These should be fully independent since they have no biological connection.

I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?

Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.

If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).

Could it be that the OS has some of the compiled bytecode or data in cache? 🤔

Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

What runtime are we speaking here about? I tend to be a fan of some intermediate output (if one can also shut it up, f.e. with some verbosity level)

Oct 15 '24 07:10 timtreis

also on this question

How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.

yes, and yes. If neighbor calculation is run with multiple slides, then a block diagonal matrix of the spatial neighbor is returned, in fact treating the slides to be independent.

I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?

Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.

yes agree! You can take a look at the Parallel module we have now and just reuse it

Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.

absolutely please log everything you see fit!

Oct 28 '24 03:10 giovp