liger
liger copied to clipboard
Can't exclude factors from louvainCluster
Hi Liger team,
I bumped into an issue with louvainCluster()
that I think should be a relatively quick and easy solve, but wanted to make sure I wan't missing anything first.
So I have some data, and I noticed while looking at the UMAP projections that there's an odd kind of split/mirroring in some of my clusters (for ex. see here clusters 14, 21, and 4, but also 13, 3, and 10):
data:image/s3,"s3://crabby-images/dda20/dda20304218f9b8f676da29210ed7777a4a9a249" alt=""
A little digging shows that all of those splits are caused by the same factor, and basically all of the major cell populations show some separation based on that factor:
data:image/s3,"s3://crabby-images/6d82b/6d82b0d26a31f3ce838a36785091b2554619ec7f" alt=""
Given how suspiciously artifactual it looks, we'd like to exclude it for now. Shouldn't be a problem, I can just run the UMAP algorithm again without that factor and that solves the issue for the projection, but when I run louvainCluster()
, I can't tell it to it to ignore the problematic factor so it's finding clusters that correspond to said factor and the result is a bit messy (see clusters 1/6, 7/11, and 4/0):
data:image/s3,"s3://crabby-images/bfc9b/bfc9bda491a3b22d89ab45ef54cc549849b0d610" alt=""
My first instinct was just to try and add a dims.use
parameter to louvainCluster()
a la runUMAP()
. I did this in my own fork (can also add as a PR), and it seemed to clean things up alright, but to be frank I'm not sure if I'm missing some more sophisticated reason why this is a bad idea.
within louvainCluster()
:
knn <- RANN::nn2([email protected], k = k, eps = eps)
just becomes
knn <- RANN::nn2([email protected][, dims.use], k = k, eps = eps)
and we get a much nicer picture:
Thanks so much, Karl
Hi Karl,
Not member of dev team but have you tried the solution shown in in Liu et al., 2020 (Nature Protocols) (pg 3645). https://www.nature.com/articles/s41596-020-0391-8
To remove these factors from further analysis, we again run quantile_norm, with the dims.use parameter equal to the set difference of the list of all factors and technical artifacts. To perform quantile normalization and UMAP, type in:
i_and_o <- quantile_norm(i_and_o, dims.use = setdiff(1:40, c(21,23)))
Best, Sam
Hi Karl,
Thank you for your valuable input. We have adopted your suggestion. If the quantile_norm
with dims.use
(Thanks @samuel-marsh for bringing it up) still does not resolve the issue (Note: no factor gets removed in H.norm
), now you can exclude unwanted factors for louvainCluster
via dims.use
parameter.
Best, Chao
Hey all,
I did try running quantile_norm
with dims.use
set, but long story short, it doesn't appear to have much of an effect. Excluding the factor does slightly alter the H.norm
result and subsequent UMAP embedding, but unless I explicitly exclude it in both runUMAP
and louvainCluster
that problematic factor is still distinguishable.
Just to illustrate this point really quick, I can run
noRG.liger <- quantile_norm(noRG.liger)
or
noRG.liger <- quantile_norm(noRG.liger, dims.use = 1:21)
before calling
noRG.liger <- runUMAP(noRG.liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.32)
noRG.liger <- louvainCluster(noRG.liger, resolution = 0.25, k = 30)
on each result and the graphs aren't really meaningfully different:
(quantile_norm nothing excluded)
(quantile_norm one factor excluded)
(See for example the the shadows of mitotic cells in that loop, etc. I promise the factor loading plots for both look basically identical to the one up in the OP)
So, it's looking like I really do need to explicitly exclude that factor in the UMAP and louvain calls (thanks @cgao90 for including that !). However it did sort of get me thinking: now that there are three distinct places to do so, in what situations does it make sense to exclude factors at each of the stages? Like should I, as a rule, always use the same dims.use
parameter in each function call? Or are there times where it would make sense to exclude a factor in quantile_norm
but not in runUMAP
or louvainCluster
? Or vice versa etc. etc.? And why did the factor exclusion in the Liu et.al. paper (thanks @samuel-marsh !) work for that dataset but not for mine?
To be honest I haven't really thought enough about the answers to any of those questions (so if anyone has any thoughts would be happy to hear them), but I do have at least a guess for that last one.
Right so why did excluding factors from quantile_norm
work for the Interneuron/Oligodendrocyte dataset in Liu et.al. and not for mine? As far as I can tell, the first step of quantile_norm
is to cluster the cells based on their maximum factor coeeficients (up to some knn refinement). Excluding a factor from dims.use
forces cells that would have been assigned to the excluded factor into their next highest cluster assignment. Then since the quantile normalization is cluster specific, moving those cells into different clusters means we aren't as aggresively aligning cells across datasets that are only similar in their expression of the problematic factor.
So in Liu et.al. it makes sense to exclude factors 21 and 23 from quantile_norm
because the algorithm was too strongly aligning cells that were only similar based on their expression of mitochondrial transcripts. However in my dataset, the problematic factor is also causing cells that should be a single population to separate (I suspect it's a cell state or stress response type of thing). It may be the case that there is some spurious alignment between cells that highly express that factor (and I will likely exclude the factor from quantile_norm
just to be safe), but since it's also manifesting as a problem in cell clusters that don't correspond directly to that factor, I also have to ignore it later in louvainCuster
if I don't want it to affect how I'm calling cells.
Anyway thanks so much for y'alls help. Karl