scRNA.seq.datasets
scRNA.seq.datasets copied to clipboard
R scripts don't produce same RDS files
I've noticed that downloading the raw data and running the provided R scripts for the data sets to generate the RDS files outputs different results than what the ones on the website. Specifically, the logcounts
don't match up everywhere.
I discovered this while looking through the scmap paper and going through the data sets. This problem only occurs for the data sets with CPM normalization (in Supplementary Table 2 from scmap) namely: Goolam, Li, Kolodziejczyk, Baron, Segerstolpe, Klein, Zeisel, Shekhar and Macosko.
I've gone through how the logcounts
are computed in create_sce.R
, but it doesn't match up with the actual results. For example, take the Li data set
> log2(calculateCPM(sceset, use_size_factors = FALSE) + 1)[1:5, 1:2]
RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6 9.009166 9.2430517
TNMD 0.000000 0.0000000
DPM1 4.888420 6.9251494
SCYL3 0.000000 6.7888400
C1orf112 0.000000 0.6565277
while loading li.rds
available at https://hemberg-lab.github.io/scRNA.seq.datasets/human/tissues/
gives
> logcounts(li)[1:5, 1:2]
RHA015__A549__turquoise RHA016__A549__turquoise
TSPAN6 10.352333 10.586473
TNMD 0.000000 0.000000
DPM1 6.203446 8.262799
SCYL3 0.000000 8.125773
C1orf112 0.000000 1.300885
How were these logcounts
computed?
Hi Pavlin, thanks for your question. All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved. The only reason for this being different I see that it maybe a new version of R/Bioconductor/scater/limma package, in which calculateCPM
function has been updated. calculateCPM
I think is from the limma
package. Last time the datasets on our website were generated on February 27, 2018, which was quite a long time ago, many things have been updated since then. So, my recommendation would be to use the scripts with the newest versions of all of the packages. You will have slightly different numbers, but I am sure the final results won't change dramatically. Hope this helps.
I am experiencing the same issue. I don't think it's the calculateCPM
function (which comes from scater). This is a pretty straightforward transformation (equivalent to t(t(counts)*1e6/colSums(counts))
) and I can't find any record of it changing in the scater
documentation.
All of the datasets were created from the scripts provided in the repository. In fact, this was done automatically, I wasn't involved.
I'm unable to run your scripts without the changes in https://github.com/ForrestCKoch/scRNA.seq.datasets/commit/09074c19fe24ba609dcfee4cea20cbb16d45d1d4. Are these the same scripts you used?