Add function for calculating niches
Description
Adds a function that calculates niches using different strategies. The initial function calculates niches based on neighborhood profiles similar to here.
This PR will get updated with methods discussed in https://github.com/scverse/squidpy/issues/789.
Codecov Report
Attention: Patch coverage is 17.71429% with 144 lines in your changes missing coverage. Please review.
Project coverage is 68.30%. Comparing base (
df8e042) to head (bd16795). Report is 8 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #831 +/- ##
==========================================
- Coverage 69.99% 68.30% -1.69%
==========================================
Files 39 40 +1
Lines 5525 5707 +182
Branches 1029 1081 +52
==========================================
+ Hits 3867 3898 +31
- Misses 1363 1514 +151
Partials 295 295
| Files | Coverage Δ | |
|---|---|---|
| src/squidpy/gr/_niche.py | 17.71% <17.71%> (ø) |
hi @LLehner I was just wondering what's the status of this? I see a lot of conflicts and tests still missing. No rush, just curious if there is a timeline
@giovp I want to wrap it up this month. Tests will be added and if possible 1-2 more methods and the previous ones compared to their original implementations.
@giovp @timtreis PR is ready for review!
Some additional questions that came up, which you could have a look at:
- How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run
sq.gr.spatial_neighbors()on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results. - I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?
- If you run
sq.calculate_niche()more than one time withflavor="neighborhood"i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring tosc.pp.neighborshere) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data). - Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.
Hey @LLehner
How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.
You mean a scenario where the user is just storing multiple slides in the same sdata object, like the point8, point16, ´point24` dataset we have? These should be fully independent since they have no biological connection.
I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?
Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.
If you run sq.calculate_niche() more than one time with flavor="neighborhood" i noticed that the first function call takes as much time as you would expect but for subsequent runs it's much (~10x) faster (e.g. first you run the method with one cluster resolution then call the method again with some other cluster resolutions). Its almost like the following function runs don't do neighborhood calculations (referring to sc.pp.neighbors here) anymore, but that shouldn't be the case, as nothing is cached and calculations happen on a new AnnData object for every function call. Also the "issue" persists even if you change the count matrix shape (by e.g. masking data).
Could it be that the OS has some of the compiled bytecode or data in cache? 🤔
Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.
What runtime are we speaking here about? I tend to be a fan of some intermediate output (if one can also shut it up, f.e. with some verbosity level)
also on this question
How should multiple slides be dealt with? All these methods should work with multiple slides, but i'm still not sure how the adjacency matrix changes when you run sq.gr.spatial_neighbors() on data with multiple slides. Is it a block diagonal matrix where each block on the diagonal is the adjacency matrix for a single slide? If yes, does it suffice to calculate the neighborhood graph once for all data or should it be done on individual slides? I noticed this can change niche results.
yes, and yes. If neighbor calculation is run with multiple slides, then a block diagonal matrix of the spatial neighbor is returned, in fact treating the slides to be independent.
I think some parts could be sped up by parallelization (e.g. clustering with multiple resolutions), what would you recommend there?
Do you have some rough numbers? We could give the user the option to define n_cores or sth and then use the parallel processing we already have, but we should definitely keep the option to run single-threaded so that it can be used inside tools like snakemake that take care of job distribution across cores. Otherwise, that'd cause errors the user cannot circumvent.
yes agree! You can take a look at the Parallel module we have now and just reuse it
Would you include some form of logging or verbosity such that user sees what the method is currently doing? Depending on the data and settings, it the function can take a while.
absolutely please log everything you see fit!