squidpy
squidpy copied to clipboard
Add method to calculate embeddings for variable by distance aggregation
Description
Adds a method in tools
to calculate embeddings of variables by their counts aggregated by distance.
Example usage
import squidpy as sq
load example data set
adata = sq.datasets.seqfish()
Calculate distances of each observation to a specified anchor point (e.g. cell type or tissue location). Here we use cell type "Endothelium" in the annotation column "celltype_mapped_refined":
sq.tl.var_by_distance(adata, groups="Endothelium", cluster_key="celltype_mapped_refined")
The resulting distances are stored in adata.obsm["design_matrix"]
. Now we can calculate the embeddings, which are returned as a new anndata object:
adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")
Note that by default the bin of distance 0, meaning the counts that belong to the anchor point, are excluded. This can be changed by setting include_anchor=True
in sq.tl.var_embeddings()
.
adata_new.X contains the aggregated var x distance_bin count matrix. adata_new.obs contains the variables as a categorical matrix, which is required to highlight them in plots.
TODO
- [ ] Add a plotting function so this doesn't need to be done manually.
- [ ] Allow flexible embedding calculations
Codecov Report
Attention: Patch coverage is 33.33333%
with 24 lines
in your changes are missing coverage. Please review.
Project coverage is 69.75%. Comparing base (
df8e042
) to head (8ee07ba
).
Additional details and impacted files
@@ Coverage Diff @@
## main #807 +/- ##
==========================================
- Coverage 69.99% 69.75% -0.24%
==========================================
Files 39 40 +1
Lines 5525 5561 +36
Branches 1029 1037 +8
==========================================
+ Hits 3867 3879 +12
- Misses 1363 1387 +24
Partials 295 295
Files | Coverage Δ | |
---|---|---|
src/squidpy/tl/_var_embeddings.py | 33.33% <33.33%> (ø) |
hi @LLehner , thank you for this, would you mind elaborating a bit when this would be used? also, what if the embedding are pre-calculated, or the user would like to use something other than the UMAP, should that be an option? finally, I think a test would be required before we get this in, thanks!
Hey @giovp, this feature was coming out of a discussion with @maiiashulman. We ran into a situation in which the "literature-curated" signature for hypoxia was either 20 or 4000 genes, the latter obviously being useless. So we wondered which other genes maybe show the same spatially variable pattern as a function of distance to a certain cell-type (e.g. epithelial). This is essentially a graphical method to see if a given set of genes (f.e. the 20 gene signature) even varies in a similar pattern.
But I agree with your points; if we see that it's actually doing something useful, we should make it a bit more flexible.
@timtreis this function now returns an anndata
object, which is i think simplifies further processing, compared to storing the new count matrix somewhere in .varm
or .uns
. Because if we want to make us of already implemented dimreduction and clustering methods from scanpy
, then the count matrix needs to be in .X
and for visualization we need the variable names stored as categories in .obs
. Doing all of this in the same anndata
will just make things cluttered.
Additionally the question is whether a spatialdata
object should be required as input instead of an anndata
one, because then a new table could be added directly instead of having multiple disconnected tables.
The function call would change from:
adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")
to
sq.tl.var_embeddings(sdata, group="Endothelium", design_matrix_key="design_matrix")