seurat
seurat copied to clipboard
Reproducibility issue
Hi, I'm using the standard seurat pipeline (NormalizeData
-> ScaleData
-> RunPCA
-> RunUMAP
) on my data, with all parameters set as defaults except for the RunUMAP(dims = 1:50)
. However, I found that the UMAP outputs for the same dataset were inconsistent across different devices. I have been using a docker image (lyc1995/bioinfo:1.0.3
on dockerhub) built by myself to keep the same environment. Is there a way to ensure the analysis exactly reproducible?
Thank you!
Different embedding seed perhaps? If you have the same uwot, Seurat or umap-learn then that is the only thing I can think of.
Different embedding seed perhaps? If you have the same uwot, Seurat or umap-learn then that is the only thing I can think of.
Hi, Kristof. I don't think it is the seed issue. Both RunPCA()
and RunUMAP()
set the seed as 42 by default, and I kept that defaults. Actually I think that it may be an irlba
issue, which was mentioned in https://github.com/bwlewis/irlba/issues/61#issuecomment-974781327.
I possibly can endorse @lyc-1995, because I firstly also thought this as a random number issue behind approximated SVD of irlba
. However, I've compared the output of RunPCA() using three different ways, with same random seed:
- Two replicated running using
approx = TRUE
. - Two replicated running using
approx = FALSE
. - Two replicated running using
approx = TRUE
, but feedv
of firstirlba
output to the second run (actually, if we do not set same random seed across two runs, by this way, we can also get consistent SVD across runs)
My assumption is if only two runs of "1." show inconsistent SVD, then it would be because of randomness of irlba
. However, with set.seed(42)
, all 6 SVDs are exactly same checked by which(svd.1$u != svd.2$u)
and which(svd.1$v != svd.2$v
.
I also thought this would be a numerical precision issue across devices which running on different underneath infrastructures.
You may find that irlba returns very slightly different results across different devices due to differences in numerical precision, and this will affect UMAP results
@lyc-1995 can you check that the PCA embeddings are identical across the different devices?
This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.
Hi, @timoast
Sorry for the long-time delay!
It seems that the different UMAP embeddings are derived from the different PCA outputs, which due to irlba::irlba()
. I replaced it with RSpectra::svds()
and fixed my issue. Now I can get identical UMAPs from different devices!
I have created a github repo for testing the reproducibility of Seurat pipeline (https://github.com/lyc-1995/seurat_reproducible). And you can find the details in my jupyter notebook.
Also, would you consider about adding RSpectra::svds()
as alternative method in RunPCA()
?
Thank you!
There must be some mistakes during converting .md
to pdf and all the outputs of cells were lost! I regenerated the .png
from the original notebook.
Hi @lyc-1995, thanks for looking into this. irlba is quite a bit faster than Rspectra (at least according to this), but in some cases I'm sure reproducibility can be more important to users than speed. I'll try to add an option to run with Rspectra rather than irlba. You can also run with approx=FALSE
to use prcomp
rather than irlba
, which I would assume also gives reproducible results across machines, but will be substantially slower.
Hi, @timoast , I found another potential source of non-reproduciblility, that the variable features output from FindVariableFeatures()
may be in different orders on different machines, even the feature sets were identical. I didn't further explore whether the output of RunPCA()
could differ using scale.data
with different row orders, but maybe simply forcing the orders of VariableFeatures()
or row names of scale.data
can further guarantee the reproduciblity easily.
In my previously modified RunPCA()
, I set object <- object[order(rownames(x = object)), ]
before calculating the SVDs. You can get the codes from R/modify_seurat.R
in https://github.com/lyc-1995/seurat_reproducible.
Below is the pdf showing the results of my test.
run_std_comparison-Copy1.pdf
Hi,@lyc-1995, I also encountered the non reproduciblility of variable features. Each time I run the FindVariableFeatures(), the feature sets were not identical, so can you give me some advice or help?
Thanks for this great package. Similar to @lyc-1995, I am trying to set up a singularity instance when I park my project to ensure my collaborators can revise the analysis during the publication process if necessary. But it doesn't give the same results as my original analysis and it gives different results when run on different machines. I have also encountered this reproducibility issue with the same modules reported by @lyc-1995 ie the umap construction and FindVariableFeatures. The lack of reproducibility on the umap is particularly nasty because it impacts the downstream analysis, specifically a different number of clusters can be determined on the new umap and /or they can be in different orders. Is it possible to set the seed in irlba to avoid this? Setting the seed at the umap construction level doesn't seem to be enough. Similarly for FindVariablesFeatures I get slightly different results.
While the numerical outputs of PCA will be the same regardless of feature order, the signs of individual components can flip as these are arbitrary. This does not change downstream results or distance matrices between cells, but could result in a visual UMAP change. There's not a lot we can do about this, we do aim to support reproducibility as much as possible, but as his been discussed in this thread- even minor differences in numerical precision can affect non-linear dimensional reduction algorithms.