pertpy possible to compute distances on a subset of genes?

Description of feature

Hi! Thank you for making this package available! I was wondering if it is possible to compute distances between groups of cells for a subset of genes (for example differentially expressed between 2 groups)? Thanks in advance.

Aug 14 '24 15:08 aterceros

Great suggestion! We usually calculate most distances in lower-dimensional spaces (such as PCA) since distances in high dimensions are bad. Depending on how large your set of genes is you can either

(If many genes and partially redundant) Calculate PCA on a subset of genes, then use that subset PCA for calculating distances. You can use the mask_var argument in scanpy.pp.pca for this.
(If few genes) Directly calculate distances on the subset. In this case I would just put your subset in adata.obsm['X_subset'] = adata[:, gene_subset'].X.copy(), then specify pt.tl.Distance(metric="euclidean", obsm_key="X_subset").

@Zethson since our distance function is already flexible enough to handle this case by specifying a different key in obsm I think there is no need to implement this feature here directly. We could a small example on this to the docs though because this approach is quite useful for analysis.

Oct 21 '24 09:10 stefanpeidli

Thank you for the comment! I'll try the second option!

Oct 30 '24 15:10 aterceros

Hi! Thank you for the suggestion above, I tried the second suggestion and seems to work well. However, when I run the bootsrap option, I get very large variances (i.e. between 120-160) for some comparisons only. Would you say that such large variance values can occur?

What I ran: adata.obsm['X_subset'] = adata[:, geneset].X.copy() distance = pt.tl.Distance(metric="wasserstein", obsm_key="X_subset") X = adata.obsm["X_subset"][adata.obs["condition"] == "A"] Y = adata.obsm["X_subset"][adata.obs["condition"] == "B"] D = distance.bootstrap(X,Y)

my gene subsets are ~ 100 genes (DEGs).

Thank you!

Dec 11 '24 00:12 aterceros

@stefanpeidli do you have a comment, please?

Jun 03 '25 13:06 Zethson

Ah sorry for overlooking this! Thanks @Zethson for pinging me!

Honestly, I never calculated the variance with Wasserstein distance, so I do not know what sensible values for this case are. That said, above 100 sounds like a scale issue. Since you are calculating distances directly on gene expression, the scales can get really big. I recommend scaling your data with zscore-normalization prior to calculating distances.

For context: when e.g. calculating distances on PCA space, the data is scaled implicitly by PCA so we do not observe this issue.

Jun 04 '25 12:06 stefanpeidli