TCGAbiolinks
TCGAbiolinks copied to clipboard
GDCprepare with duplicated samples error
Hello!
I'm trying to download information from TCGA-UCEC, and I'm having an error with duplicated elements in the columns. When running my query
query <- GDCquery(project = "TCGA-UCEC",
data.category = "Gene expression",
data.type = "Gene expression quantification",
experimental.strategy = "RNA-Seq",
platform = c("Illumina HiSeq", "Illumina GA"),
file.type = "results",
legacy = TRUE)
I get the following error:
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg19
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-UCEC
--------------------
oo Filtering results
--------------------
ooo By platform
ooo By experimental.strategy
ooo By data.type
ooo By file.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
I didn't think it was a big deal until I ran the following code:
GDCdownload(query, files.per.chunk = 100)
# Prepare expression matrix with geneID in the rows and samples (barcode) in the columns
# rsem.genes.results as values
UCECRnaseqSE <- GDCprepare(query,
save = TRUE,
save.filename = "query-UCEC.rda" ,
summarizedExperiment = TRUE)
And got this error:
| |tags |cases |experimental_strategy |
|:---|:-------------------------------|:----------------------------|:---------------------|
|437 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq |
|545 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq |
Error in GDCprepare(query, save = TRUE, save.filename = "query-UCEC.rda", :
There are samples duplicated. We will not be able to prepare it
Is there any way I can solve this? I tried deleting the repeated columns but I couldn't do it.
Thanks!
Hello!
I'm trying to download information from TCGA-UCEC, and I'm having an error with duplicated elements in the columns. When running my query
query <- GDCquery(project = "TCGA-UCEC", data.category = "Gene expression", data.type = "Gene expression quantification", experimental.strategy = "RNA-Seq", platform = c("Illumina HiSeq", "Illumina GA"), file.type = "results", legacy = TRUE)
I get the following error:
-------------------------------------- o GDCquery: Searching in GDC database -------------------------------------- Genome of reference: hg19 -------------------------------------------- oo Accessing GDC. This might take a while... -------------------------------------------- ooo Project: TCGA-UCEC -------------------- oo Filtering results -------------------- ooo By platform ooo By experimental.strategy ooo By data.type ooo By file.type ---------------- oo Checking data ---------------- ooo Check if there are duplicated cases Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio ooo Check if there results for the query ------------------- o Preparing output -------------------
I didn't think it was a big deal until I ran the following code:
GDCdownload(query, files.per.chunk = 100) # Prepare expression matrix with geneID in the rows and samples (barcode) in the columns # rsem.genes.results as values UCECRnaseqSE <- GDCprepare(query, save = TRUE, save.filename = "query-UCEC.rda" , summarizedExperiment = TRUE)
And got this error:
| |tags |cases |experimental_strategy | |:---|:-------------------------------|:----------------------------|:---------------------| |437 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq | |545 |c("unnormalized", "gene", "v2") |TCGA-AX-A1C7-01A-11R-A137-07 |RNA-Seq | Error in GDCprepare(query, save = TRUE, save.filename = "query-UCEC.rda", : There are samples duplicated. We will not be able to prepare it
Is there any way I can solve this? I tried deleting the repeated columns but I couldn't do it.
Thanks!
I have the same error as you.
I solve the problem by removing redundant cases with following codes. query.exp.hg19 <- GDCquery(project = "TCGA-GBM", data.category = "Gene expression", data.type = "Isoform expression quantification", platform = "Illumina HiSeq", legacy = T)
query.exp.hg19.2=query.exp.hg19 tmp=query.exp.hg19.2$results[[1]] tmp=tmp[which(!duplicated(tmp$cases)),] query.exp.hg19.2$results[[1]]=tmp
GDCdownload(query.exp.hg19.2) gbm_exp=GDCprepare(query.exp.hg19.2)
The problem is the same sample has two platforms (Illumina GA and Illumina. HiSeq).
data:image/s3,"s3://crabby-images/ae99d/ae99d7ec1361e94f19ce6b7dbfb2f652f5091c89" alt="Screen Shot 2020-04-30 at 6 00 10 PM"
The package does not to handle duplicated samples in the prepare step, because that would require to change sample names and I would prefer not to do that automatically. The users have look closely to the data on those cases and decide what is the best solution. Normally we remove duplicates from our TCGA analysis. Since this type of scenario should not happen in the harmonized version of the database, I prefer not to handle it.
The only way to solve this removing the sample manually.
@tiagochst @osj118 I've tried running your code to eliminate the duplicated samples, but now I'm getting this error:
Error in Ops.data.frame(y[, 1], ret[[1]][, 1]) : ‘==’ only defined for equally-sized data frames
I am very new at this, Is there a way I can eliminate the duplicates? Thank you in advance. :)