scanpy icon indicating copy to clipboard operation
scanpy copied to clipboard

Questions about sc.pp.scale and sc.tl.pca

Open HelloWorldLTY opened this issue 2 years ago • 4 comments

Hi,

I found that in some tutorial documents, they does not use sc.pp.scale before sc.tl.pca. https://scanpy-tutorials.readthedocs.io/en/latest/integrating-data-using-ingest.html

But for some documents, they used. https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html

This thing also happened in several tools' analysis codes. https://docs.scvi-tools.org/en/stable/tutorials/notebooks/api_overview.html

Therefore, I wonder if the scale function is a key step for PCA analysis or not. Thanks.

HelloWorldLTY avatar Mar 03 '22 06:03 HelloWorldLTY

Hello, I have the same doubt. I think with the use of sc.pp.scale, the distribution of genes(equal to different variables) is normal distribution which mean is 0 and variance is 1. And this is an ideal data moduel for PCA. So I wanna know whether sc.pp.scale is the import step before sc.tl.pca for the reason I guess above.

MugenQin avatar Apr 20 '22 09:04 MugenQin

Hi both,

I think I discuss this briefly in the current best practices paper (note: not so current anymore).

TL;DR: there is no consensus on whether to scale or not in the field. sc.tl.pca will zero center all genes so that the first PC doesn't just capture mean variation, but scaling goes beyond that. I guess this is a general question of whether you would like all genes to contribute equally to the PC embedding, or whether you are okay with this being driven somewhat by variance of the gene (and therefore by its mean -> due to the mean-variance relationship in scRNA-seq data).

LuckyMD avatar Apr 20 '22 09:04 LuckyMD

Hi, thanks for your ideas and discussion. For me, I think doing scaling is necessary because if the data is not centred to 0, the plane we find based on the covariance matrix may not be the optimized one. The PCA optimization process only works for data with 0 centered I think.

HelloWorldLTY avatar Apr 20 '22 11:04 HelloWorldLTY

As mentioned above, zero-centering is generally done in sc.tl.pca. So both with and without scaling data will be zero-centered before running pca.

LuckyMD avatar Apr 20 '22 11:04 LuckyMD

In the help documentation of sc.pp.scale, it is said "zero_center If False, omit zero-centering variables, which allows to handle sparse input efficiently. I am still confused about zero_center. If zero_center=False, what will sc.pp.scale do ? Could you give a simple example ? For example, [1,2,3] would be [-1.22,0,1.22] after scaling, but what if zero_center=False ?

wangjiawen2013 avatar Oct 27 '22 08:10 wangjiawen2013

In the help documentation of sc.pp.scale, it is said "zero_center If False, omit zero-centering variables, which allows to handle sparse input efficiently. I am still confused about zero_center. If zero_center=False, what will sc.pp.scale do ? Could you give a simple example ? For example, [1,2,3] would be [-1.22,0,1.22] after scaling, but what if zero_center=False ?

Just the data will be only scaled by stds, the means wouldn't be subtracted.

potulabe avatar Jan 04 '23 09:01 potulabe

We will close the issue for now, as it appears the questions herein have been adressed :)

However, please don't hesitate to reopen this issue or create a new one if you have any more questions or run into any related problems in the future.

Thanks for being a part of our community! :)

eroell avatar Oct 12 '23 09:10 eroell