Dealing with batch-effects
Hello,
I would like to try MOFA+ for some CITE-Seq data and I was wondering if you could provide some recommendations on how to deal with batch effects. We‘re generally working with data derived from 2-3 different experiments, between which we observe minor batch effects for the RNA component and quite significant batch effects for the surface protein component.
I‘ve found that fastMNN gives good batch corrected PCA-like embeddings, but the reconstructed counts should be used with caution, as they can have negative values. I saw that your FAQ mentions limma, but if I recall benchmarks generally show inferior batch correcting abilities when it comes to single-cell data.
Any insights or recommendations would be much appreciated, thanks!
One thing to try could be training a MOFA+ model and seeing if the first few factors correspond to batch effect — and if the other factors correlate with the experiment covariate.
It should also be possible to normalize protein counts with CLR/dsb and then use a batch-correction method of choice that could output corrected counts. I can imagine, as the number of surface proteins would usually be comparable to the number of components one typically uses, you could even replace the count matrix with the batch-corrected embeddings, but this will complicate downstream interpretability.
Alright, I'll try looking for a "batch factor" or using the corrected counts. Thanks!