propr copied to clipboard
Question about dimension reduction for single-cell and other data
As outlined in A field guide for the compositional analysis of any-omics data CoDa should be the way to do single-cell RNA seq.
I'd like to cluster cells (into cell-types) using CoDA. The first step is to do a dimensional reduction (DR) then clustering.
What would be the best way to do the DR? I see two ways of doing this:
- Apply CLR (or some variant e.g. iqLR) and perform PCA.
- Use proper phi as association strength between cells (usually propr is uses for the association between genes) then do a t-SNE.
As far as I understood it, the 'most common' way it is done in scSeq would be:
- Do some normalisation, there are many different ways
- Take most important PCA components
- Build a NN-graph and apply t-SNE/umap (and do some graph based culstering)
I tried the CLR-PCA and found that the dominating 1 PC is strongly correlated with the number of counts. I hoped that the CoDa based method would remove (at least part) of this bias.
Hey SilasK, thanks for your interest in propr!
Regarding dimensional reduction in CoDa, I would tend to run a CLR (or some variant) on the sample rows, then perform a PCA. You can use phi as a kind of distance measure, but typically phi is used to describe distances between features rather than distances between samples. When calculating phi on the transpose, the features (not the samples) would get CLR-transformed which doesn't make much sense to me.
I think the scSeq workflow is the pretty much the same, except that the normalization step is replaced with a CLR. In fact, CLR can be thought of as a kind of normalization that isn't too different from effective size library normalization. We tried to make this a bit clearer in the section "The Quest for a Common Scale" . I have not thought much about using t-SNE/UMAP for CLR-transformed data, but I have fit multi-layer perceptrons (NN) on CLR-transformed data which works nicely .
Regarding "I tried the CLR-PCA and found that the dominating 1 PC is strongly correlated with the number of counts. I hoped that the CoDa based method would remove (at least part) of this bias." -- This is a very tricky one. I'll jot down some possible causes and solutions below.
I have seen this before when the number of zeros differs greatly between samples. During zero imputation, zeros are replaced with a very small number. The CLR requires a geometric mean of the sample. When a sample has more zeros, it gets imputed to have more small numbers, which pulls the whole geometric mean down. If this event is significant, the geometric mean "normalizing factor" begins to correlate with total counts (as do many genes). My guess is that the first PCA reflects this process.
What to do about it? Assuming the problem is in fact due to differences in the number of zeros between samples, the trick here is to even out the influence of zeros somehow. A few ideas,
(1) Use a different reference. Martino et al. propose the "robust CLR" which replaces the geometric mean CLR with a reference computed from the non-zero elements only. IIRC this is somewhat similar to what DESeq2 recommends.
(2) Rarefaction! I know this is a bit taboo, but if you down-sample all your data to have the same total sequencing depth, you remove the effect of sequence depth altogether (though you do not remove the effect of compositionality). If the sequencing depth is the cause of the differences in zeros, this should "even out" the number of zeros present in your data.
(3) Use a method that does not depend on CLR or zero imputation. We propose data-driven amalgamation to learn useful lower-dimension representations of the data. In short, it sums parts (e.g., genes) into groups to form a smaller simplex that approximately represents the larger simplex (by minimizing a loss function). It has an R package `amalgam'.