liger icon indicating copy to clipboard operation
liger copied to clipboard

Can't exclude factors from louvainCluster

Open ksyoungnm opened this issue 3 years ago • 3 comments

Hi Liger team,

I bumped into an issue with louvainCluster() that I think should be a relatively quick and easy solve, but wanted to make sure I wan't missing anything first.

So I have some data, and I noticed while looking at the UMAP projections that there's an odd kind of split/mirroring in some of my clusters (for ex. see here clusters 14, 21, and 4, but also 13, 3, and 10):

A little digging shows that all of those splits are caused by the same factor, and basically all of the major cell populations show some separation based on that factor:

Given how suspiciously artifactual it looks, we'd like to exclude it for now. Shouldn't be a problem, I can just run the UMAP algorithm again without that factor and that solves the issue for the projection, but when I run louvainCluster(), I can't tell it to it to ignore the problematic factor so it's finding clusters that correspond to said factor and the result is a bit messy (see clusters 1/6, 7/11, and 4/0):

My first instinct was just to try and add a dims.use parameter to louvainCluster() a la runUMAP(). I did this in my own fork (can also add as a PR), and it seemed to clean things up alright, but to be frank I'm not sure if I'm missing some more sophisticated reason why this is a bad idea.

within louvainCluster():

knn <- RANN::nn2([email protected], k = k, eps = eps)

just becomes

knn <- RANN::nn2([email protected][, dims.use], k = k, eps = eps)

and we get a much nicer picture:

Thanks so much, Karl

ksyoungnm avatar Aug 26 '21 23:08 ksyoungnm

Hi Karl,

Not member of dev team but have you tried the solution shown in in Liu et al., 2020 (Nature Protocols) (pg 3645). https://www.nature.com/articles/s41596-020-0391-8

To remove these factors from further analysis, we again run quantile_norm, with the dims.use parameter equal to the set difference of the list of all factors and technical artifacts. To perform quantile normalization and UMAP, type in:

i_and_o <- quantile_norm(i_and_o, dims.use = setdiff(1:40, c(21,23)))

Best, Sam

samuel-marsh avatar Sep 01 '21 17:09 samuel-marsh

Hi Karl,

Thank you for your valuable input. We have adopted your suggestion. If the quantile_norm with dims.use (Thanks @samuel-marsh for bringing it up) still does not resolve the issue (Note: no factor gets removed in H.norm), now you can exclude unwanted factors for louvainCluster via dims.use parameter.

Best, Chao

cgao90 avatar Sep 02 '21 03:09 cgao90

Hey all,

I did try running quantile_norm with dims.use set, but long story short, it doesn't appear to have much of an effect. Excluding the factor does slightly alter the H.norm result and subsequent UMAP embedding, but unless I explicitly exclude it in both runUMAP and louvainCluster that problematic factor is still distinguishable.

Just to illustrate this point really quick, I can run

noRG.liger <- quantile_norm(noRG.liger)

or

noRG.liger <- quantile_norm(noRG.liger, dims.use = 1:21)

before calling

noRG.liger <- runUMAP(noRG.liger, distance = 'cosine', n_neighbors = 30, min_dist = 0.32)
noRG.liger <- louvainCluster(noRG.liger, resolution = 0.25, k = 30)

on each result and the graphs aren't really meaningfully different: (quantile_norm nothing excluded) (quantile_norm one factor excluded) (See for example the the shadows of mitotic cells in that loop, etc. I promise the factor loading plots for both look basically identical to the one up in the OP)

So, it's looking like I really do need to explicitly exclude that factor in the UMAP and louvain calls (thanks @cgao90 for including that !). However it did sort of get me thinking: now that there are three distinct places to do so, in what situations does it make sense to exclude factors at each of the stages? Like should I, as a rule, always use the same dims.use parameter in each function call? Or are there times where it would make sense to exclude a factor in quantile_norm but not in runUMAP or louvainCluster? Or vice versa etc. etc.? And why did the factor exclusion in the Liu et.al. paper (thanks @samuel-marsh !) work for that dataset but not for mine?

To be honest I haven't really thought enough about the answers to any of those questions (so if anyone has any thoughts would be happy to hear them), but I do have at least a guess for that last one.

Right so why did excluding factors from quantile_norm work for the Interneuron/Oligodendrocyte dataset in Liu et.al. and not for mine? As far as I can tell, the first step of quantile_norm is to cluster the cells based on their maximum factor coeeficients (up to some knn refinement). Excluding a factor from dims.use forces cells that would have been assigned to the excluded factor into their next highest cluster assignment. Then since the quantile normalization is cluster specific, moving those cells into different clusters means we aren't as aggresively aligning cells across datasets that are only similar in their expression of the problematic factor.

So in Liu et.al. it makes sense to exclude factors 21 and 23 from quantile_norm because the algorithm was too strongly aligning cells that were only similar based on their expression of mitochondrial transcripts. However in my dataset, the problematic factor is also causing cells that should be a single population to separate (I suspect it's a cell state or stress response type of thing). It may be the case that there is some spurious alignment between cells that highly express that factor (and I will likely exclude the factor from quantile_norm just to be safe), but since it's also manifesting as a problem in cell clusters that don't correspond directly to that factor, I also have to ignore it later in louvainCuster if I don't want it to affect how I'm calling cells.

Anyway thanks so much for y'alls help. Karl

ksyoungnm avatar Sep 04 '21 00:09 ksyoungnm