possible to compute distances on a subset of genes?
Description of feature
Hi! Thank you for making this package available! I was wondering if it is possible to compute distances between groups of cells for a subset of genes (for example differentially expressed between 2 groups)? Thanks in advance.
Great suggestion! We usually calculate most distances in lower-dimensional spaces (such as PCA) since distances in high dimensions are bad. Depending on how large your set of genes is you can either
- (If many genes and partially redundant) Calculate PCA on a subset of genes, then use that subset PCA for calculating distances. You can use the
mask_varargument inscanpy.pp.pcafor this. - (If few genes) Directly calculate distances on the subset. In this case I would just put your subset in
adata.obsm['X_subset'] = adata[:, gene_subset'].X.copy(), then specifypt.tl.Distance(metric="euclidean", obsm_key="X_subset").
@Zethson since our distance function is already flexible enough to handle this case by specifying a different key in obsm I think there is no need to implement this feature here directly. We could a small example on this to the docs though because this approach is quite useful for analysis.
Thank you for the comment! I'll try the second option!
Hi! Thank you for the suggestion above, I tried the second suggestion and seems to work well. However, when I run the bootsrap option, I get very large variances (i.e. between 120-160) for some comparisons only. Would you say that such large variance values can occur?
What I ran: adata.obsm['X_subset'] = adata[:, geneset].X.copy() distance = pt.tl.Distance(metric="wasserstein", obsm_key="X_subset") X = adata.obsm["X_subset"][adata.obs["condition"] == "A"] Y = adata.obsm["X_subset"][adata.obs["condition"] == "B"] D = distance.bootstrap(X,Y)
- my gene subsets are ~ 100 genes (DEGs).
Thank you!
@stefanpeidli do you have a comment, please?
Ah sorry for overlooking this! Thanks @Zethson for pinging me!
Honestly, I never calculated the variance with Wasserstein distance, so I do not know what sensible values for this case are. That said, above 100 sounds like a scale issue. Since you are calculating distances directly on gene expression, the scales can get really big. I recommend scaling your data with zscore-normalization prior to calculating distances.
For context: when e.g. calculating distances on PCA space, the data is scaled implicitly by PCA so we do not observe this issue.