TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

Fix duplicates issue when querying expression data

Open ShixiangWang opened this issue 8 months ago • 0 comments

I was trying to obtain STAR gene expression data from multiple datasets including TCGA, TARGET, and CPTAC. Some errors happened when preparing some projects, and they are mainly related to duplicated records, especially in the CPTAC-3. I tested the code and made a workaround. This PR also amended some data checks to make sure the data preparation process went smoothly.

o Preparing output
-------------------
Downloading data for project CPTAC-3
Of the 2340 files for download 2340 already exist.
All samples have been already downloaded
Removing duplicated cases (with older updated time)
 => 41 records removed

Code to query the data.

library(TCGAbiolinks)
#devtools::load_all("/Volumes/EPan/Repos/TCGAbiolinks/")
#install.packages("/Volumes/EPan/Repos/TCGAbiolinks/", repos = NULL, type = "source")

prjs = TCGAbiolinks::getGDCprojects()
tcga_prjs = prjs$project_id[startsWith(prjs$project_id, "TCGA")]
targ_prjs = prjs$project_id[startsWith(prjs$project_id, "TARGET")]
cpta_prjs = prjs$project_id[startsWith(prjs$project_id, "CPTAC")]

# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

dir.create("data")
dir.create("data/count")
dir.create("data/tpm")

# GDC data download and prepare ------------------------------------------------

for (i in c(tcga_prjs, targ_prjs, cpta_prjs)) {
  message("Handling ", i)
  # if (file.exists(glue::glue("data/tpm/{i}.exp.tpm.rds"))) {
  #   message("handled already, skipping...")
  #   next()
  # }
  query.exp.hg38 <- GDCquery(
    project = i,
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts"
  )

  GDCdownload(query.exp.hg38)
  #debug(GDCprepare)
  #debug(readTranscriptomeProfiling)
  #debug(colDataPrepare)
  expdat <- GDCprepare(
    query = query.exp.hg38
  )
  gc()

  message("saving...")
  saveRDS(SummarizedExperiment::assays(expdat)[["unstranded"]], file = glue::glue("data/count/{i}.exp.count.rds"))
  saveRDS(SummarizedExperiment::assays(expdat)[["tpm_unstrand"]], file = glue::glue("data/tpm/{i}.exp.tpm.rds"))
  message("done")
  gc()
}

ShixiangWang avatar Jun 19 '24 07:06 ShixiangWang