scanpy
scanpy copied to clipboard
Questions about sc.pp.scale and sc.tl.pca
Hi,
I found that in some tutorial documents, they does not use sc.pp.scale before sc.tl.pca. https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html
But for some documents, they used. https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html
This thing also happened in several tools' analysis codes. https://docs.scvi-tools.org/en/stable/tutorials/notebooks/api_overview.html
Therefore, I wonder if the scale function is a key step for PCA analysis or not. Thanks.
Hello, I have the same doubt. I think with the use of sc.pp.scale, the distribution of genes(equal to different variables) is normal distribution which mean is 0 and variance is 1. And this is an ideal data moduel for PCA. So I wanna know whether sc.pp.scale is the import step before sc.tl.pca for the reason I guess above.
Hi both,
I think I discuss this briefly in the current best practices paper (note: not so current anymore).
TL;DR: there is no consensus on whether to scale or not in the field. sc.tl.pca
will zero center all genes so that the first PC doesn't just capture mean variation, but scaling goes beyond that. I guess this is a general question of whether you would like all genes to contribute equally to the PC embedding, or whether you are okay with this being driven somewhat by variance of the gene (and therefore by its mean -> due to the mean-variance relationship in scRNA-seq data).
Hi, thanks for your ideas and discussion. For me, I think doing scaling is necessary because if the data is not centred to 0, the plane we find based on the covariance matrix may not be the optimized one. The PCA optimization process only works for data with 0 centered I think.
As mentioned above, zero-centering is generally done in sc.tl.pca
. So both with and without scaling data will be zero-centered before running pca.
In the help documentation of sc.pp.scale, it is said "zero_center If False
, omit zero-centering variables, which allows to handle sparse input efficiently.
I am still confused about zero_center. If zero_center=False, what will sc.pp.scale do ? Could you give a simple example ? For example, [1,2,3] would be [-1.22,0,1.22] after scaling, but what if zero_center=False ?
In the help documentation of sc.pp.scale, it is said "zero_center If
False
, omit zero-centering variables, which allows to handle sparse input efficiently. I am still confused about zero_center. If zero_center=False, what will sc.pp.scale do ? Could you give a simple example ? For example, [1,2,3] would be [-1.22,0,1.22] after scaling, but what if zero_center=False ?
Just the data will be only scaled by stds, the means wouldn't be subtracted.
We will close the issue for now, as it appears the questions herein have been adressed :)
However, please don't hesitate to reopen this issue or create a new one if you have any more questions or run into any related problems in the future.
Thanks for being a part of our community! :)