scRNA.seq.datasets icon indicating copy to clipboard operation
scRNA.seq.datasets copied to clipboard

R scripts don't produce same RDS files

Open pavlin-policar opened this issue 6 years ago • 2 comments

I've noticed that downloading the raw data and running the provided R scripts for the data sets to generate the RDS files outputs different results than what the ones on the website. Specifically, the logcounts don't match up everywhere.

I discovered this while looking through the scmap paper and going through the data sets. This problem only occurs for the data sets with CPM normalization (in Supplementary Table 2 from scmap) namely: Goolam, Li, Kolodziejczyk, Baron, Segerstolpe, Klein, Zeisel, Shekhar and Macosko.

I've gone through how the logcounts are computed in create_sce.R, but it doesn't match up with the actual results. For example, take the Li data set

> log2(calculateCPM(sceset, use_size_factors = FALSE) + 1)[1:5, 1:2]
         RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6                  9.009166               9.2430517
TNMD                    0.000000               0.0000000
DPM1                    4.888420               6.9251494
SCYL3                   0.000000               6.7888400
C1orf112                0.000000               0.6565277

while loading li.rds available at https://hemberg-lab.github.io/scRNA.seq.datasets/human/tissues/ gives

> logcounts(li)[1:5, 1:2]
         RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6                 10.352333               10.586473
TNMD                    0.000000                0.000000
DPM1                    6.203446                8.262799
SCYL3                   0.000000                8.125773
C1orf112                0.000000                1.300885

How were these logcounts computed?

pavlin-policar avatar Jul 07 '18 16:07 pavlin-policar

Hi Pavlin, thanks for your question. All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved. The only reason for this being different I see that it maybe a new version of R/Bioconductor/scater/limma package, in which calculateCPM function has been updated. calculateCPM I think is from the limma package. Last time the datasets on our website were generated on February 27, 2018, which was quite a long time ago, many things have been updated since then. So, my recommendation would be to use the scripts with the newest versions of all of the packages. You will have slightly different numbers, but I am sure the final results won't change dramatically. Hope this helps.

wikiselev avatar Jul 11 '18 09:07 wikiselev

I am experiencing the same issue. I don't think it's the calculateCPM function (which comes from scater). This is a pretty straightforward transformation (equivalent to t(t(counts)*1e6/colSums(counts))) and I can't find any record of it changing in the scater documentation.

All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved.

I'm unable to run your scripts without the changes in https://github.com/ForrestCKoch/scRNA.seq.datasets/commit/09074c19fe24ba609dcfee4cea20cbb16d45d1d4. Are these the same scripts you used?

ForrestCKoch avatar Oct 09 '19 00:10 ForrestCKoch