Potential Bug: GDC Prepare does not work for breast cancer data

Open fabianjkrueger opened this issue 1 year ago • 1 comments

Hello!

There seems to be an issue with preparing of certain data sets for analysis. It's weird, since if works for some of the projects, but it doesn't work for others. One of the projects causing issues here is breast cancer ("BRCA"). I queried and downloaded the data for the different projects in a script like shown below.

GDCquery(project = "TCGA-BRCA",
                           data.category = "Simple Nucleotide Variation",
                           data.type = "Masked Somatic Mutation")

# this is the step that just wont work for breast cancer...
mutationDataBRCA <- GDCprepare(mutationQueryBRCA, # specify which query to use
                           save = TRUE, # save the output as as a file
                           save.filename = file.path(prepared_path, "BRCA_SNVMSM.RData"),
                           directory = dl_path, # directory where downloaded files are stored
                           remove.files.prepared = FALSE)

All paths are stored in variables, so this is not the issue. This code works for almost all the other cancer types, for example colon adenocarcinoma (project "COAD").

This is the error message I get:

Error in `dplyr::bind_rows()`:
! Can't combine `..151$Tumor_Seq_Allele2` <character> and `..152$Tumor_Seq_Allele2` <logical>.
Backtrace:
 1. TCGAbiolinks::GDCprepare(...)
 2. TCGAbiolinks:::readSimpleNucleotideVariationMaf(files)
 3. purrr::map_dfr(...)
 4. dplyr::bind_rows(res, .id = .id)

To me, it looks like there is a problem with data types, but I don't know how to fix it.

Is there anything else I might be missing? Are there temporary files that depend on loading a specific library for reading them? If not, there might be a bug.

Feb 19 '24 15:02 fabianjkrueger

I encountered a similar bug while preparing query for TCGA-UCEC. To do with TCGAbiolinks:::readSimpleNucleotideVariationMaf call where an empty table leads to incompatible column type. My workaround uses data.table::fread instead ot readr:

query <- GDCquery(
    project = "TCGA-UCEC", data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation",
    data.format = "MAF"
)
GDCdownload(query)
# query_results <- GDCprepare(query) # this errors out
files <- file.path(
    "GDCdata",
    query$results[[1]]$project,
    gsub(" ", "_", query$results[[1]]$data_category),
    gsub(" ", "_", query$results[[1]]$data_type),
    gsub(" ", "_", query$results[[1]]$file_id), 
    gsub(" ", "_", query$results[[1]]$file_name)
)
maf_data <- do.call(rbind, lapply(files, fread, header = T, skip = "#", sep = "\t"))

TCGAbiolinks v2.32.0, readr v2.1.5, R version 4.4.1 (2024-06-14)

Oct 03 '24 14:10 DzmitryGB