TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

New GDC readout for harmonized TCGA data: 'Allele-specific Copy Number Segment'

Open schelhorn opened this issue 4 years ago • 2 comments

Hi there,

since GDC release 23 there is a new readout for TCGA, 'Allele-specific Copy Number Segment'. It would be great if it were downloadable in TCGAbiolinks via:

query.cnv = TCGAbiolinks::GDCquery(project='TCGA-ACC',
                                         data.category='Copy Number Variation',
                                         data.type='Allele-specific Copy Number Segment',
                                         legacy=F)
TCGAbiolinks::GDCdownload(query.cnv, directory='./')
tcga.cnv = TCGAbiolinks::GDCprepare(query.cnv, directory='./')

or similar. For internal testing I overwrote TCGAbiolinks::checkDataTypeInput to allow the new readout, and querying and downloading the files via the GDC API already works that way (i.e., it populates the query.cnv data frame with the correct GDC file IDs and downloads the files). However, I still get apprising errors in TCGAbiolinks::GDCprepare. Anyways, I hope the extension is straightforward.

schelhorn avatar Apr 14 '20 08:04 schelhorn

Update: in the meantime, this dirty patch allows reading in the allele-specific files as described above (no guarantees, though ;):


checkDataTypeInput2 <- function(legacy, data.type){
    if(legacy){
        legacy.data.type <- c("Copy number segmentation",
                              "Raw intensities",
                              "Aligned reads",
                              "Copy number estimate",
                              "Simple nucleotide variation",
                              "Gene expression quantification",
                              "Coverage WIG",
                              "miRNA gene quantification",
                              "Genotypes",
                              "miRNA isoform quantification",
                              "Normalized copy numbers",
                              "Isoform expression quantification",
                              "Normalized intensities",
                              "Tissue slide image",
                              "Exon quantification",
                              "Exon junction quantification",
                              "Methylation beta value",
                              "Unaligned reads",
                              "Diagnostic image",
                              "CGH array QC",
                              "Biospecimen Supplement",
                              "Pathology report",
                              "Clinical Supplement",
                              "Intensities",
                              "Protein expression quantification",
                              "Microsatellite instability",
                              "Structural variation",
                              "Auxiliary test",
                              "Copy number QC metrics",
                              "Intensities Log2Ratio",
                              "Methylation array QC metrics",
                              "Clinical data",
                              "Copy number variation",
                              "ABI sequence trace",
                              "Biospecimen data",
                              "Simple somatic mutation",
                              "Bisulfite sequence alignment",
                              "Methylation percentage",
                              "Sequencing tag",
                              "Sequencing tag counts",
                              "LOH")
        if(!data.type %in% legacy.data.type) {
            print(knitr::kable(as.data.frame(sort(legacy.data.type))))
            stop("Please set a data.type argument from the column legacy.data.type above")
        }
    } else {
        harmonized.data.type <- c(
            "Aggregated Somatic Mutation",
            "Gene Expression Quantification",
            "Raw CGI Variant",
            "Methylation Beta Value",
            "Splice Junction Quantification",
            "Annotated Somatic Mutation",
            "Raw Simple Somatic Mutation",
            "Masked Somatic Mutation",
            "Copy Number Segment",
            "Allele-specific Copy Number Segment",
            "Masked Copy Number Segment",
            "Isoform Expression Quantification",
            "miRNA Expression Quantification",
            "Biospecimen Supplement",
            "Gene Level Copy Number Scores",
            "Clinical Supplement",
            "Masked Somatic Mutation",
            "Slide Image")
        if(!data.type %in% harmonized.data.type) {
            print(knitr::kable(as.data.frame(sort(harmonized.data.type))))
            stop("Please set a data.type argument from the column harmonized.data.type above")
        }
    }
}

# Reads Copy Number Variation files to a data frame, basically it will rbind it
readCopyNumberVariation2 <- function(files, cases){
  message("Reading copy number variation files")
  pb <- txtProgressBar(min = 0, max = length(files), style = 3)
  for (i in seq_along(files)) {
    if (grepl('ascat2', files[i])) {
      data <- read_tsv(file = files[i], col_names = TRUE, col_types = "ccnnnnn")
      if(!missing(cases)) data$Sample <- cases[i]
      if(i == 1) df <- data
      if(i != 1) df <- rbind(df, data)
    } else {
      data <- read_tsv(file = files[i], col_names = TRUE, col_types = "ccnnnd")
      if(!missing(cases)) data$Sample <- cases[i]
      if(i == 1) df <- data
      if(i != 1) df <- rbind(df, data)
    }
    setTxtProgressBar(pb, i)
  }
  close(pb)
  return(df)
}

assignInNamespace("checkDataTypeInput", checkDataTypeInput2, ns="TCGAbiolinks")
assignInNamespace("readCopyNumberVariation", readCopyNumberVariation2, ns="TCGAbiolinks")

schelhorn avatar Apr 14 '20 08:04 schelhorn

@schelhorn Thanks!

tiagochst avatar May 01 '20 15:05 tiagochst