TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

Bug in TCGAanalyze_Normalization function?

Open jacorvar opened this issue 4 years ago • 1 comments

Dear all,

I have downloaded HTSeq-Counts harmonized data for Lung Adenocarcinoma. Raw counts look like this:

                TCGA-44-7671-01A-11R-2066-07 TCGA-49-6744-11A-01R-1858-07 TCGA-05-4432-01A-01R-1206-07
ENSG00000207629                            0                            0                            0
ENSG00000207644                            0                            0                            0
ENSG00000207699                            0                            0                            0
ENSG00000207706                            0                            0                            0
ENSG00000207711                            0                            0                            0
ENSG00000207722                            0                            0                            0

When I normalized it with TCGAanalyze_Normalization function using geneLength, I get the following:

                TCGA-44-7671-01A-11R-2066-07 TCGA-49-6744-11A-01R-1858-07 TCGA-05-4432-01A-01R-1206-07
ENSG00000207629                          594                          177                          567
ENSG00000207644                          286                          400                          277
ENSG00000207699                         1620                         1280                         4743
ENSG00000207706                          132                          162                          156
ENSG00000207711                          482                          208                          353
ENSG00000207722                         1806                          578                         2794

I do not know what happens, but I suspect such results cannot be right. From zero counts to hundreds/thousands sounds like a bug.

Here my code:

rootdir <- '../'
datadir <- file.path(rootdir, 'data')
gdcdir <- file.path(datadir, 'gdcdir')
tumor <- 'luad'
project <- paste0('TCGA-', toupper(tumor))
query.exp.rna <- GDCquery(project = project, 
                          legacy = F,
                          data.category = "Transcriptome Profiling", 
                          experimental.strategy = 'RNA-Seq',
                          data.type = "Gene Expression Quantification",
                          workflow.type = 'HTSeq - counts',
                          sample.type = c("Primary Tumor","Solid Tissue Normal"))
GDCdownload(query.exp.rna, directory = gdcdir)
data.exp.rna <- GDCprepare(query = query.exp.rna,
                    save = TRUE, 
                    save.filename = file.path(datadir, paste(tumor, "DNAexp.rna.harmonized.rda", sep = '.')),
                    summarizedExperiment = TRUE, directory = gdcdir)

expfile <- file.path(datadir, 'luad.DNAexp.rna.harmonized.rda') 
expData <- attach(expfile)$data
expRaw <- subset(x = expData, select = !colData(expData)$is_ffpe)
expPrep <- TCGAanalyze_Preprocessing(object = expRaw, 
                                      cor.cut = 0.7,
                                      datatype = "HTSeq - Counts")                      

expNorm <- TCGAanalyze_Normalization(tabDF = expPrep,
                                       geneInfo = geneInfoHT,
                                       method = "geneLength") 

jacorvar avatar May 10 '20 16:05 jacorvar

@torongs82 Have you already had a look at this issue?

jacorvar avatar Oct 13 '20 17:10 jacorvar