lemur icon indicating copy to clipboard operation
lemur copied to clipboard

Expectations for UMAP of embeddings

Open Thapeachydude opened this issue 1 year ago • 2 comments

Hi,

interesting tool (and the use of the single-cell experiment class is much appreciated). We have single-cell RNA-Seq data from multiple samples at various conditions. This far, I've seen the most sensible results by creating pseudo-bulks per cell type, sample and condition and fitting a linear model ~ sample + Condition using edgeR.

So, I was very curious when I saw your tool. This far, however, I'm not sure I understand the output. I ran:

fit <- lemur(sce, design = ~ sampleID + Treatment, n_embedding = 20)

set.seed(100)
fit <- runUMAP(fit, dimred = "embedding", n_neighbors = 15, min_dist = 0.25, name = "UMAP_embedding", BPPARAM = mcparam)

to get an overview of the embeddings. I would expect at least some separation based on the conditions (since for some of them the pseudo-bulk results are quite strong, and we can even appreciate them in a UMAP of PCA loadings). But I see a big blob of cells, and some very small individual groups. But no "Treatment-shifts".

Is this what you would expect?

Btw. is there a way to limit the memory of the lemur function call? It is very fast but super memory intesive. Ideally, I don't always need to run it on a HPC.

Best, M

Thapeachydude avatar Dec 04 '23 16:12 Thapeachydude

Hi M,

thanks for your interest and reaching out.

I would expect at least some separation based on the conditions (since for some of them the pseudo-bulk results are quite strong, and we can even appreciate them in a UMAP of PCA loadings)

LEMUR tries to absorb as much of the variation in the data associated with the known covariates into $R(x)$ and $S(x)$ so that the embedding ($Z$) will show you the residual variance (i.e., everything that is varying not due to sampleID or Treatment). This would typically be different cell states.

But I see a big blob of cells, and some very small individual groups. But no "Treatment-shifts".

Depending on your data this might be a reasonable outcome. If there is not much latent heterogeneity (is your data from a cell line for example?) you would expect to see one big blob and with cells from all conditions intermixed.

Best, Constantin

const-ae avatar Dec 04 '23 16:12 const-ae

Btw. is there a way to limit the memory of the lemur function call? It is very fast but super memory intesive. Ideally, I don't always need to run it on a HPC.

There currently is no easy way to limit the memory requirements beyond subsampling your cells and subsetting to a reasonable set of highly variable genes.

const-ae avatar Dec 04 '23 16:12 const-ae