seurat icon indicating copy to clipboard operation
seurat copied to clipboard

Reproducibility issues in Seurat.

Open jmgs7 opened this issue 4 months ago • 17 comments

Dear Seurat team,

Recently, our team has encountered significant reproducibility challenges while using Seurat. When applying the same code to identical datasets on different devices, we observe substantial variations in the results.

Upon investigation, we have identified that the root cause lies in the Principal Component Analysis (PCA) output. Specifically:

  1. PC Composition Differences: When analyzing the 30 principal components (PCs) used for subsequent clustering, we notice discrepancies in the composition of each PC. Genes associated with a particular PC differ between runs, leading to inconsistent results.

  2. PC Score Inversions: The most common issue is value inversions in PC scores. Genes that exhibit positive scores on one computer may have negative scores on another, and vice versa. These variations in PC scores propagate throughout the entire analysis pipeline, significantly impacting the final results.

Furthermore, even when re-running the analysis on the same computer, we encounter run-to-run variability, indicating that the issue persists locally.

Example Scenario:

  • Clustering: We observe differences in the clustering results.
  • Differential Expression Analyses: Variability affects comparisons both between clusters and within different cell types within a single cluster.

Environment Details:

  • We are using the latest version of R, and all relevant packages are up-to-date.
  • Our analysis was tested on three different computers: two running Linux and one on Windows.
  • Attempts to enforce reproducibility using methods like set.seed() and similar functions did not yield consistent results.

Provided Examples:

  • HTML Notebooks: We have attached HTML notebooks showcasing our analysis on two devices:
    • Device A: Ubuntu 22.04.
    • Device B: Windows 11.
  • panc8 Test Data: We also reproduced the issue using the panc8 test data, and the corresponding script is included.

Seurat_reproducibility_issues.zip

We kindly request your expertise in addressing this challenge. Reliable results are crucial for our ongoing analysis, and we appreciate any insights or solutions you can provide.

Thank you for your attention, and we look forward to your guidance.

Best regards,

JM.

jmgs7 avatar Feb 19 '24 17:02 jmgs7

I am having the same issue, where setting a seed and rerunning the same code on the same computer with a fixed R environment can randomly generate variable UMAPs.

mehc555 avatar Feb 21 '24 13:02 mehc555

I have observed this issue as well. Running the same script on different PCs with the same package versions results in nearly inverted or slight alterations in the UMAPs, which would make sense with the PCA issue.

ChristopherStephens21 avatar Feb 22 '24 01:02 ChristopherStephens21

It is also mentioned at #8342 which is very similar of what I have observed.

jmgs7 avatar Feb 23 '24 12:02 jmgs7

Hi @jmgs7, thanks for providing the scripts. Can you rerun them outputting your sessionInfo() at the end?

saketkc avatar Feb 23 '24 22:02 saketkc

Hi @jmgs7, can you also additionally output:

sessionInfo()
RNGkind()

Thanks!

saketkc avatar Feb 24 '24 01:02 saketkc

Dear @saketkc,

Thank you very much for your reply and sorry for the delay.

I attach the same folder with the updated scripts. I labeled the old html notebooks as _1 and the new ones as _2 (those are the ones with the information you requested) in order to check for run-to-run variability, and I actually I found one. In the test with our data run on the Device B (Run_with_our_data_DeviceB) you can see differences in the vulcan plot at the end of the document.

Seurat_reproducibility_issues.zip

We really appreciate your attention, we hope to hear about this issue soon.

Best regards,

JM.

jmgs7 avatar Feb 27 '24 08:02 jmgs7

Hi @jmgs7, thanks for sharing the sessioninfo. Can you install latest Seurat (v5.0.2) and share the files again? I have not been able to reproduce this at my end yet.

saketkc avatar Mar 04 '24 02:03 saketkc

Sure, there you have it. I added the new files as filename_v5.0.2.

Seurat_reproducibility_issues.zip

Thank you very much for your time.

jmgs7 avatar Mar 04 '24 08:03 jmgs7

Dear @saketkc,

Have you been able to reproduce the issues I encountered?

Thank you.

jmgs7 avatar Mar 18 '24 08:03 jmgs7

Unfortunately no. Can you confirm if you notice the same issue without 'vars.to.regress'

saketkc avatar Mar 18 '24 09:03 saketkc

hi @jmgs7, would you be able to share the objects with me post you run SCTransform? You can save it just after the SCTransform step.

saketkc avatar Mar 22 '24 17:03 saketkc

Dear @saketkc,

Sorry for the delay in my answer, but I have been out for holidays.

Sure, I will send you the rds objects for both the test I am running from both devices.

Because the files are too big, I will send a download link to your institutional email from NY Genome Cente.

Thank you again for your time,

Best regards.

jmgs7 avatar Apr 02 '24 09:04 jmgs7

Hi all,

I'm also having consistency issues with my scripts. The scripts haven't changed since I last ran them back in January, but are now producing different UMAPs. If any more information on this issue comes to light, please keep us updated.

Thank you!

luxhexmaster avatar Apr 08 '24 15:04 luxhexmaster

Dear @saketkc,

Have you received our email?

Please, confirm you got access to our data to resend it if you have encountered any problem.

Best regards,

JM.

jmgs7 avatar Apr 10 '24 11:04 jmgs7

Hello everyone,

I'm encountering the same issue while using the same virtual environment controlled with Conda. Specifically, I'm utilizing the v5 version. I've executed the identical code on two separate workstations: one running Ubuntu 22.04.3 LTS and the other Ubuntu 18.04.5 LTS. Like some others who have reported similar problems, it appears that the issue lies within the RunPCA function. Notably, I'm observing divergent values for the first two Principal Components (PCs). I'm using published data from GEO (GSE135779), just an small subset (GSM4029923, GSM4029911, GSM4029917, GSM4029932, GSM4029939) that has 32K cells.

For clarity, here are the results obtained on the first workstation:

PCA_batch

And here are the results from the second workstation:

PCA_batch

Thank you!

ralodo93 avatar Apr 23 '24 06:04 ralodo93

Hi @jmgs7, thanks for sharing the objects. Can you also share the objects you get without invoking vars.to.regress. It seems the residuals are slightly different between the objects (but I am unable to reproduce this at my end)

saketkc avatar Apr 23 '24 11:04 saketkc

Dear @saketkc,

I would like to make you know that we have repeated the analysis but using the standard normalization pipeline instead of SCTransform, and all the reproducibility issues have been solved. We obtain the same results across multiple devices, so it seems that definitively the reproducibility issues come from this function.

Despite we like more the clustering resolution that we obtain when using SCTransform, we are going to switch to the standard pipeline for now (we use log(CPM +1) as other tools such as scranpy).

Do not hesitate to ask for any further checks or comparison you may need to find out the origin of this variability in SCTransform.

We are deeply thankful for your time and we hope to help you in whatever way possible to solve this issue.

Best regards,

JM.

jmgs7 avatar May 03 '24 10:05 jmgs7