seurat
seurat copied to clipboard
Reproducibility issues in Seurat.
Dear Seurat team,
Recently, our team has encountered significant reproducibility challenges while using Seurat. When applying the same code to identical datasets on different devices, we observe substantial variations in the results.
Upon investigation, we have identified that the root cause lies in the Principal Component Analysis (PCA) output. Specifically:
-
PC Composition Differences: When analyzing the 30 principal components (PCs) used for subsequent clustering, we notice discrepancies in the composition of each PC. Genes associated with a particular PC differ between runs, leading to inconsistent results.
-
PC Score Inversions: The most common issue is value inversions in PC scores. Genes that exhibit positive scores on one computer may have negative scores on another, and vice versa. These variations in PC scores propagate throughout the entire analysis pipeline, significantly impacting the final results.
Furthermore, even when re-running the analysis on the same computer, we encounter run-to-run variability, indicating that the issue persists locally.
Example Scenario:
- Clustering: We observe differences in the clustering results.
- Differential Expression Analyses: Variability affects comparisons both between clusters and within different cell types within a single cluster.
Environment Details:
- We are using the latest version of R, and all relevant packages are up-to-date.
- Our analysis was tested on three different computers: two running Linux and one on Windows.
- Attempts to enforce reproducibility using methods like set.seed() and similar functions did not yield consistent results.
Provided Examples:
- HTML Notebooks: We have attached HTML notebooks showcasing our analysis on two devices:
- Device A: Ubuntu 22.04.
- Device B: Windows 11.
- panc8 Test Data: We also reproduced the issue using the panc8 test data, and the corresponding script is included.
Seurat_reproducibility_issues.zip
We kindly request your expertise in addressing this challenge. Reliable results are crucial for our ongoing analysis, and we appreciate any insights or solutions you can provide.
Thank you for your attention, and we look forward to your guidance.
Best regards,
JM.
I am having the same issue, where setting a seed and rerunning the same code on the same computer with a fixed R environment can randomly generate variable UMAPs.
I have observed this issue as well. Running the same script on different PCs with the same package versions results in nearly inverted or slight alterations in the UMAPs, which would make sense with the PCA issue.
It is also mentioned at #8342 which is very similar of what I have observed.
Hi @jmgs7, thanks for providing the scripts. Can you rerun them outputting your sessionInfo()
at the end?
Hi @jmgs7, can you also additionally output:
sessionInfo()
RNGkind()
Thanks!
Dear @saketkc,
Thank you very much for your reply and sorry for the delay.
I attach the same folder with the updated scripts. I labeled the old html notebooks as _1 and the new ones as _2 (those are the ones with the information you requested) in order to check for run-to-run variability, and I actually I found one. In the test with our data run on the Device B (Run_with_our_data_DeviceB) you can see differences in the vulcan plot at the end of the document.
Seurat_reproducibility_issues.zip
We really appreciate your attention, we hope to hear about this issue soon.
Best regards,
JM.
Hi @jmgs7, thanks for sharing the sessioninfo. Can you install latest Seurat (v5.0.2) and share the files again? I have not been able to reproduce this at my end yet.
Sure, there you have it. I added the new files as filename_v5.0.2.
Seurat_reproducibility_issues.zip
Thank you very much for your time.
Dear @saketkc,
Have you been able to reproduce the issues I encountered?
Thank you.
Unfortunately no. Can you confirm if you notice the same issue without 'vars.to.regress'
hi @jmgs7, would you be able to share the objects with me post you run SCTransform? You can save it just after the SCTransform step.
Dear @saketkc,
Sorry for the delay in my answer, but I have been out for holidays.
Sure, I will send you the rds objects for both the test I am running from both devices.
Because the files are too big, I will send a download link to your institutional email from NY Genome Cente.
Thank you again for your time,
Best regards.
Hi all,
I'm also having consistency issues with my scripts. The scripts haven't changed since I last ran them back in January, but are now producing different UMAPs. If any more information on this issue comes to light, please keep us updated.
Thank you!
Dear @saketkc,
Have you received our email?
Please, confirm you got access to our data to resend it if you have encountered any problem.
Best regards,
JM.
Hello everyone,
I'm encountering the same issue while using the same virtual environment controlled with Conda. Specifically, I'm utilizing the v5 version. I've executed the identical code on two separate workstations: one running Ubuntu 22.04.3 LTS and the other Ubuntu 18.04.5 LTS. Like some others who have reported similar problems, it appears that the issue lies within the RunPCA function. Notably, I'm observing divergent values for the first two Principal Components (PCs). I'm using published data from GEO (GSE135779), just an small subset (GSM4029923, GSM4029911, GSM4029917, GSM4029932, GSM4029939) that has 32K cells.
For clarity, here are the results obtained on the first workstation:
And here are the results from the second workstation:
Thank you!
Hi @jmgs7, thanks for sharing the objects. Can you also share the objects you get without invoking vars.to.regress
. It seems the residuals are slightly different between the objects (but I am unable to reproduce this at my end)
Dear @saketkc,
I would like to make you know that we have repeated the analysis but using the standard normalization pipeline instead of SCTransform, and all the reproducibility issues have been solved. We obtain the same results across multiple devices, so it seems that definitively the reproducibility issues come from this function.
Despite we like more the clustering resolution that we obtain when using SCTransform, we are going to switch to the standard pipeline for now (we use log(CPM +1) as other tools such as scranpy).
Do not hesitate to ask for any further checks or comparison you may need to find out the origin of this variability in SCTransform.
We are deeply thankful for your time and we hope to help you in whatever way possible to solve this issue.
Best regards,
JM.