squidpy icon indicating copy to clipboard operation
squidpy copied to clipboard

Add method to calculate embeddings for variable by distance aggregation

Open LLehner opened this issue 11 months ago • 4 comments

Description

Adds a method in tools to calculate embeddings of variables by their counts aggregated by distance.

Example usage

import squidpy as sq

load example data set adata = sq.datasets.seqfish()

Calculate distances of each observation to a specified anchor point (e.g. cell type or tissue location). Here we use cell type "Endothelium" in the annotation column "celltype_mapped_refined": sq.tl.var_by_distance(adata, groups="Endothelium", cluster_key="celltype_mapped_refined")

The resulting distances are stored in adata.obsm["design_matrix"]. Now we can calculate the embeddings, which are returned as a new anndata object: adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix")

Note that by default the bin of distance 0, meaning the counts that belong to the anchor point, are excluded. This can be changed by setting include_anchor=True in sq.tl.var_embeddings().

adata_new.X contains the aggregated var x distance_bin count matrix. adata_new.obs contains the variables as a categorical matrix, which is required to highlight them in plots.

TODO

  • [ ] Add a plotting function so this doesn't need to be done manually.
  • [ ] Allow flexible embedding calculations

LLehner avatar Mar 04 '24 22:03 LLehner

Codecov Report

Attention: Patch coverage is 33.33333% with 24 lines in your changes are missing coverage. Please review.

Project coverage is 69.75%. Comparing base (df8e042) to head (8ee07ba).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #807      +/-   ##
==========================================
- Coverage   69.99%   69.75%   -0.24%     
==========================================
  Files          39       40       +1     
  Lines        5525     5561      +36     
  Branches     1029     1037       +8     
==========================================
+ Hits         3867     3879      +12     
- Misses       1363     1387      +24     
  Partials      295      295              
Files Coverage Δ
src/squidpy/tl/_var_embeddings.py 33.33% <33.33%> (ø)

codecov-commenter avatar Mar 04 '24 23:03 codecov-commenter

hi @LLehner , thank you for this, would you mind elaborating a bit when this would be used? also, what if the embedding are pre-calculated, or the user would like to use something other than the UMAP, should that be an option? finally, I think a test would be required before we get this in, thanks!

giovp avatar Apr 22 '24 07:04 giovp

Hey @giovp, this feature was coming out of a discussion with @maiiashulman. We ran into a situation in which the "literature-curated" signature for hypoxia was either 20 or 4000 genes, the latter obviously being useless. So we wondered which other genes maybe show the same spatially variable pattern as a function of distance to a certain cell-type (e.g. epithelial). This is essentially a graphical method to see if a given set of genes (f.e. the 20 gene signature) even varies in a similar pattern.

But I agree with your points; if we see that it's actually doing something useful, we should make it a bit more flexible.

timtreis avatar Apr 22 '24 17:04 timtreis

@timtreis this function now returns an anndata object, which is i think simplifies further processing, compared to storing the new count matrix somewhere in .varm or .uns. Because if we want to make us of already implemented dimreduction and clustering methods from scanpy, then the count matrix needs to be in .X and for visualization we need the variable names stored as categories in .obs. Doing all of this in the same anndata will just make things cluttered.

Additionally the question is whether a spatialdata object should be required as input instead of an anndataone, because then a new table could be added directly instead of having multiple disconnected tables.

The function call would change from: adata_new = sq.tl.var_embeddings(adata, group="Endothelium", design_matrix_key="design_matrix") to sq.tl.var_embeddings(sdata, group="Endothelium", design_matrix_key="design_matrix")

LLehner avatar Aug 08 '24 11:08 LLehner