squidpy
squidpy copied to clipboard
Add method to calculate embeddings for variable by distance aggregation
Description
Adds a method in tools to calculate embeddings of variables by their counts aggregated by distance.
Example usage
import squidpy as sq
load example data set
adata = sq.datasets.seqfish()
Calculate distances of each observation to a specified anchor point (e.g. cell type or tissue location). Here we use cell type "Endothelium" in the annotation column "celltype_mapped_refined":
sq.tl.var_by_distance(adata, groups="Endothelium", cluster_key="celltype_mapped_refined")
The resulting distances are stored in adata.obsm["design_matrix"]. Now we can calculate the embeddings, which are returned as a new anndata object:
adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")
Note that by default the bin of distance 0, meaning the counts that belong to the anchor point, are excluded. This can be changed by setting include_anchor=True in sq.tl.var_embeddings().
adata_new.X contains the aggregated var x distance_bin count matrix. adata_new.obs contains the variables as a categorical matrix, which is required to highlight them in plots.
TODO
- [ ] Add a plotting function so this doesn't need to be done manually.
- [ ] Allow flexible embedding calculations
Codecov Report
Attention: Patch coverage is 33.33333% with 24 lines in your changes are missing coverage. Please review.
Project coverage is 69.75%. Comparing base (
df8e042) to head (8ee07ba).
Additional details and impacted files
@@ Coverage Diff @@
## main #807 +/- ##
==========================================
- Coverage 69.99% 69.75% -0.24%
==========================================
Files 39 40 +1
Lines 5525 5561 +36
Branches 1029 1037 +8
==========================================
+ Hits 3867 3879 +12
- Misses 1363 1387 +24
Partials 295 295
| Files | Coverage Δ | |
|---|---|---|
| src/squidpy/tl/_var_embeddings.py | 33.33% <33.33%> (ø) |
hi @LLehner , thank you for this, would you mind elaborating a bit when this would be used? also, what if the embedding are pre-calculated, or the user would like to use something other than the UMAP, should that be an option? finally, I think a test would be required before we get this in, thanks!
Hey @giovp, this feature was coming out of a discussion with @maiiashulman. We ran into a situation in which the "literature-curated" signature for hypoxia was either 20 or 4000 genes, the latter obviously being useless. So we wondered which other genes maybe show the same spatially variable pattern as a function of distance to a certain cell-type (e.g. epithelial). This is essentially a graphical method to see if a given set of genes (f.e. the 20 gene signature) even varies in a similar pattern.
But I agree with your points; if we see that it's actually doing something useful, we should make it a bit more flexible.
@timtreis this function now returns an anndata object, which is i think simplifies further processing, compared to storing the new count matrix somewhere in .varm or .uns. Because if we want to make us of already implemented dimreduction and clustering methods from scanpy, then the count matrix needs to be in .X and for visualization we need the variable names stored as categories in .obs. Doing all of this in the same anndata will just make things cluttered.
Additionally the question is whether a spatialdata object should be required as input instead of an anndataone, because then a new table could be added directly instead of having multiple disconnected tables.
The function call would change from:
adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")
to
sq.tl.var_embeddings(sdata, group="Endothelium", design_matrix_key="design_matrix")