TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

GDCprepare with duplicated samples error

Open lucianamontivero opened this issue 6 years ago • 4 comments

Hello!

I'm trying to download information from TCGA-UCEC, and I'm having an error with duplicated elements in the columns. When running my query

query <- GDCquery(project = "TCGA-UCEC", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = c("Illumina HiSeq", "Illumina GA"),
                  file.type = "results",
                  legacy = TRUE)

I get the following error:

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg19
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-UCEC
--------------------
oo Filtering results
--------------------
ooo By platform
ooo By experimental.strategy
ooo By data.type
ooo By file.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------

I didn't think it was a big deal until I ran the following code:

GDCdownload(query, files.per.chunk = 100)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
UCECRnaseqSE <- GDCprepare(query,
                           save = TRUE,
                           save.filename = "query-UCEC.rda" ,
                           summarizedExperiment = TRUE)

And got this error:

|    |tags                            |cases                        |experimental_strategy |
|:---|:-------------------------------|:----------------------------|:---------------------|
|437 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq               |
|545 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq               |
Error in GDCprepare(query, save = TRUE, save.filename = "query-UCEC.rda",  : 
  There are samples duplicated. We will not be able to prepare it

Is there any way I can solve this? I tried deleting the repeated columns but I couldn't do it.

Thanks!

lucianamontivero avatar Jan 24 '19 13:01 lucianamontivero

Hello!

I'm trying to download information from TCGA-UCEC, and I'm having an error with duplicated elements in the columns. When running my query

query <- GDCquery(project = "TCGA-UCEC", 
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  experimental.strategy = "RNA-Seq",
                  platform = c("Illumina HiSeq", "Illumina GA"),
                  file.type = "results",
                  legacy = TRUE)

I get the following error:

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg19
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-UCEC
--------------------
oo Filtering results
--------------------
ooo By platform
ooo By experimental.strategy
ooo By data.type
ooo By file.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------

I didn't think it was a big deal until I ran the following code:

GDCdownload(query, files.per.chunk = 100)

# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
UCECRnaseqSE <- GDCprepare(query,
                           save = TRUE,
                           save.filename = "query-UCEC.rda" ,
                           summarizedExperiment = TRUE)

And got this error:

|    |tags                            |cases                        |experimental_strategy |
|:---|:-------------------------------|:----------------------------|:---------------------|
|437 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq               |
|545 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq               |
Error in GDCprepare(query, save = TRUE, save.filename = "query-UCEC.rda",  : 
  There are samples duplicated. We will not be able to prepare it

Is there any way I can solve this? I tried deleting the repeated columns but I couldn't do it.

Thanks!

I have the same error as you.

weinformatics avatar Jan 09 '20 05:01 weinformatics

I solve the problem by removing redundant cases with following codes. query.exp.hg19 <- GDCquery(project = "TCGA-GBM", data.category = "Gene expression", data.type = "Isoform expression quantification", platform = "Illumina HiSeq", legacy = T)

query.exp.hg19.2=query.exp.hg19 tmp=query.exp.hg19.2$results[[1]] tmp=tmp[which(!duplicated(tmp$cases)),] query.exp.hg19.2$results[[1]]=tmp

GDCdownload(query.exp.hg19.2) gbm_exp=GDCprepare(query.exp.hg19.2)

osj118 avatar Jan 14 '20 08:01 osj118

The problem is the same sample has two platforms (Illumina GA and Illumina. HiSeq).

Screen Shot 2020-04-30 at 6 00 10 PM

The package does not to handle duplicated samples in the prepare step, because that would require to change sample names and I would prefer not to do that automatically. The users have look closely to the data on those cases and decide what is the best solution. Normally we remove duplicates from our TCGA analysis. Since this type of scenario should not happen in the harmonized version of the database, I prefer not to handle it.

The only way to solve this removing the sample manually.

tiagochst avatar May 01 '20 13:05 tiagochst

@tiagochst @osj118 I've tried running your code to eliminate the duplicated samples, but now I'm getting this error:

Error in Ops.data.frame(y[, 1], ret[[1]][, 1]) : ‘==’ only defined for equally-sized data frames

I am very new at this, Is there a way I can eliminate the duplicates? Thank you in advance. :)

DarkHe007 avatar Oct 15 '21 17:10 DarkHe007