pertpy icon indicating copy to clipboard operation
pertpy copied to clipboard

possible to compute distances on a subset of genes?

Open aterceros opened this issue 1 year ago • 3 comments

Description of feature

Hi! Thank you for making this package available! I was wondering if it is possible to compute distances between groups of cells for a subset of genes (for example differentially expressed between 2 groups)? Thanks in advance.

aterceros avatar Aug 14 '24 15:08 aterceros

Great suggestion! We usually calculate most distances in lower-dimensional spaces (such as PCA) since distances in high dimensions are bad. Depending on how large your set of genes is you can either

  • (If many genes and partially redundant) Calculate PCA on a subset of genes, then use that subset PCA for calculating distances. You can use the mask_var argument in scanpy.pp.pca for this.
  • (If few genes) Directly calculate distances on the subset. In this case I would just put your subset in adata.obsm['X_subset'] = adata[:, gene_subset'].X.copy(), then specify pt.tl.Distance(metric="euclidean", obsm_key="X_subset").

@Zethson since our distance function is already flexible enough to handle this case by specifying a different key in obsm I think there is no need to implement this feature here directly. We could a small example on this to the docs though because this approach is quite useful for analysis.

stefanpeidli avatar Oct 21 '24 09:10 stefanpeidli

Thank you for the comment! I'll try the second option!

aterceros avatar Oct 30 '24 15:10 aterceros

Hi! Thank you for the suggestion above, I tried the second suggestion and seems to work well. However, when I run the bootsrap option, I get very large variances (i.e. between 120-160) for some comparisons only. Would you say that such large variance values can occur?

What I ran: adata.obsm['X_subset'] = adata[:, geneset].X.copy() distance = pt.tl.Distance(metric="wasserstein", obsm_key="X_subset") X = adata.obsm["X_subset"][adata.obs["condition"] == "A"] Y = adata.obsm["X_subset"][adata.obs["condition"] == "B"] D = distance.bootstrap(X,Y)

  • my gene subsets are ~ 100 genes (DEGs).

Thank you!

aterceros avatar Dec 11 '24 00:12 aterceros

@stefanpeidli do you have a comment, please?

Zethson avatar Jun 03 '25 13:06 Zethson

Ah sorry for overlooking this! Thanks @Zethson for pinging me!

Honestly, I never calculated the variance with Wasserstein distance, so I do not know what sensible values for this case are. That said, above 100 sounds like a scale issue. Since you are calculating distances directly on gene expression, the scales can get really big. I recommend scaling your data with zscore-normalization prior to calculating distances.

For context: when e.g. calculating distances on PCA space, the data is scaled implicitly by PCA so we do not observe this issue.

stefanpeidli avatar Jun 04 '25 12:06 stefanpeidli