Is it appropriate/possible to run IntegrateData following GLM-PCA?

Open jeremymsimon opened this issue 2 years ago • 1 comments

Hi @willtownes, My typical Seurat workflow for multiple samples and conditions is to run PCA followed by CCA-based integration (now IntegrateLayers in Seurat v5), then identify one joint set of clusters. If I want to try swapping in GLM-PCA, is that supposed to work as-is or do I need to adjust somehow?

I just ran a test on real data using the approximation method mentioned here, where I used nullResiduals on my raw counts then ran PCA on that using the top 3k deviant genes, followed by IntegrateLayers.

The resulting UMAP and clusters looked nothing like my PCA-based analysis, so either I did something wrong or it is not appropriate to do this in the first place.

Can you share your thoughts on whether this is possible, and if so, some best practices for doing this in Seurat when the dataset is large (>50k cells)? The RunGLMPCA() helper function was itself taking too long on these data.

Thanks!

Aug 04 '23 20:08 jeremymsimon

Hi, it's a good question and I honestly don't know whether it's a good idea or not! When I wrote the GLM-PCA paper I didn't think about data integration at all except for the possibility of adjusting for cell-specific batch with dummy variables. You are right that using deviance or Pearson residuals is the recommended approximation to GLM-PCA. We implemented it in scry but there are other versions you could try such as that of Jan Lause et al. I'm guessing scry is what they used in that paper. For large data you might try fastRNA which seems to focus on scalability.

Aug 14 '23 21:08 willtownes