TCGAbiolinks GDCprepare() does not work with last update of GDC v32.0 for RNA-Seq

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"
# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")
# Read downloaded data and get a single a summarized experiment object
data <- GDCprepare(query,
                   summarizedExperiment = TRUE)

Which produces the following error:

> data <- GDCprepare(query)
|===========================================================================================================|100%                      Completed after 13 s 
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Locations 2, 3, and 4 don't exist.
i There are only 1 column.
Run `rlang::last_error()` to see where the error occurred.
There were 50 or more warnings (use warnings() to see the first 50)

Mar 29 '22 23:03 g27182818

Have you solved this problem?

Mar 30 '22 13:03 guohout

As I understand the problem is that now the STAR-Count files come with much more information and hence the prepareGDC() funciton is unable to read this new format. However I decided to open each downloaded file individually and append each needed column in a dataframe. The code I´m using now is this:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"

# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Get metadata matrix
metadata <- query[[1]][[1]]

# Download data using api
GDCdownload(query, method = "api")

# Get main directory where data is stored
main_dir <- file.path("GDCdata", project_name)
# Get file list of downloaded files
file_list <- file.path("GDCdata", project_name,list.files(main_dir,recursive = TRUE)) 

# Read first downloaded to get gene names
test_tab <- read.table(file = file_list[1], sep = '\t', header = TRUE)
# Delete header lines that don't contain usefull information
test_tab <- test_tab[-c(1:4),]
# STAR counts and tpm datasets
tpm_data_frame <- data.frame(test_tab[,1])
count_data_frame <- data.frame(test_tab[,1])

# Append cycle to get the complete matrix
for (i in c(1:length(file_list))) {
  # Read table
  test_tab <- read.table(file = file_list[i], sep = '\t', header = TRUE)
  # Delete not useful lines
  test_tab <- test_tab[-c(1:4),]
  # Column bind of tpm and counts data
  tpm_data_frame <- cbind(tpm_data_frame, test_tab[,7])
  count_data_frame <- cbind(count_data_frame, test_tab[,4])
  # Print progres from 0 to 1
  print(i/length(file_list))
}

This works and gets the data but is much slower than the original GDCprepare() function.

Mar 30 '22 15:03 g27182818

Thanks for your approach

Mar 30 '22 15:03 guohout

Also had a simliar issue, but now fixed with the update to 2.23.6, i.e. BiocManager::install("BioinformaticsFMRP/TCGAbiolinks"). Thanks for the workaround @g27182818 and the quick update @tiagochst!

Mar 30 '22 22:03 t-carroll

Just in case someone has the same problem as me, BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") was showing the following error:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace 'TCGAbiolinksGUI.data' 1.14.0 is being loaded, but >= 1.15.1 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package 'TCGAbiolinks'
* removing 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
* restoring previous 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
Installation paths not writeable, unable to update packages
  path: C:/Program Files/R/R-4.1.2/library
  packages:
    class, cluster, foreign, MASS, Matrix, mgcv, nlme, nnet, rpart, spatial, survival
Warning message:
In i.p(...) :
  installation of package ‘C:/Users/Usuario/AppData/Local/Temp/RtmpKWbI9z/filec681ec526c7/TCGAbiolinks_2.23.7.tar.gz’ had non-zero exit status

And it was because the package TCGAbiolinksGUI.data had to be also installed directly from GitHub. So, the final way to access the new GDCprepare() function is:

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

This will first update the TCGAbiolinksGUI.data to latest 1.15.1 version and then install the fixed version of TCGAbiolinks.

Mar 31 '22 16:03 g27182818

Yes, I am still updating the package. It might be stable in the next few days. I updated the gene information to use GENCODE v36 as GDC is now using. That is why I need to update TCGAbiolinksGUI.data.

Mar 31 '22 17:03 tiagochst

in my case,

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("ExperimentHub")

Restart R

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

works

Apr 04 '22 05:04 hyjforesight

I've tried all week! @hyjforesight saved me

Apr 06 '22 22:04 aysenuroner

this is a good step, but I think sample names are missing in the matrix

Apr 11 '22 05:04 snijesh

It takes a very long time after 100% prepare. My console is still busy, is it normal? Should add a notion for such case?

> library(TCGAbiolinks)
> proj <- "TCGA-STAD"
> query <- GDCquery(
+   project = proj,
+   data.category = "Transcriptome Profiling",
+   data.type = "Gene Expression Quantification",
+   workflow.type = "STAR - Counts"
+ )
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-STAD
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
> GDCdownload(query)
Downloading data for project TCGA-STAD
Of the 407 files for download 407 already exist.
All samples have been already downloaded
> data <- GDCprepare(query)
|==================================================================================================================|100%                      Completed after 50 s

Apr 11 '22 07:04 ShixiangWang

I just found that the code below significantly slow the process.

https://github.com/BioinformaticsFMRP/TCGAbiolinks/blob/6cd187eb10b27b260e16c7cb25216fdef919d43d/R/prepare.R#L1448-L1451

Instead, use data.table will speed up:

df = rbindlist(x, use.names = TRUE, idcol = "case_barcode")
data.table::dcast(df, gene_id + gene_name + gene_type ~ case_barcode, value.var = colnames(df)[-c(1:4)])

Apr 11 '22 08:04 ShixiangWang

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"
# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")
# Read downloaded data and get a single a summarized experiment object
data <- GDCprepare(query,
                   summarizedExperiment = TRUE)

Which produces the following error:

> data <- GDCprepare(query)
|===========================================================================================================|100%                      Completed after 13 s 
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Locations 2, 3, and 4 don't exist.
i There are only 1 column.
Run `rlang::last_error()` to see where the error occurred.
There were 50 or more warnings (use warnings() to see the first 50)

Could you please update the tutorials accordingly?

Thanks.

Apr 15 '22 18:04 sciencepeak

They are being update in the devel version at bioconductor.

https://bioconductor.org/packages/3.15/bioc/vignettes/TCGAbiolinks/inst/doc/index.html

You also need to update the package with the GitHub version.

On Fri, Apr 15, 2022, 2:20 PM Science Peak @.***> wrote:

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')project_name <- "TCGA-ACC"# Defines the query to the GDCquery <- GDCquery(project = project_name, data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", experimental.strategy = "RNA-Seq", workflow.type = "STAR - Counts")

Download data using api

GDCdownload(query, method = "api")# Read downloaded data and get a single a summarized experiment objectdata <- GDCprepare(query, summarizedExperiment = TRUE)

Which produces the following error:

data <- GDCprepare(query) |===========================================================================================================|100% Completed after 13 s Error in stop_subscript(): ! Can't subset columns that don't exist. x Locations 2, 3, and 4 don't exist. i There are only 1 column. Run rlang::last_error() to see where the error occurred. There were 50 or more warnings (use warnings() to see the first 50)

Could you please update the tutorials https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/download_prepare.html#Search_and_download_data_from_legacy_database_using_GDC_api_method accordingly?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/493#issuecomment-1100275620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6P7CEQR2VOOU4XA22LVFGXORANCNFSM5SAASUWA . You are receiving this because you were mentioned.Message ID: @.***>

Apr 15 '22 18:04 tiagochst

I also met this problem！

Apr 27 '22 10:04 PearlLiu-Dev

I met similar problem. It is in SNP data. However, it is not an error. It is warning, a lot of warning.

The code:

query_snp <- GDCquery( project = paste0("TCGA-", cancerType), data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", access = "open" )

GDCdownload(query=query_snp, method = "api", directory = DataDir)

maf <- GDCprepare(query = query_snp, directory = DataDir, save = TRUE, save.filename = "SNP_COAD_data.rda")

There were 50 or more warnings (use warnings() to see the first 50) warnings() 警告資訊： 1: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 2: One or more parsing issues, see problems() for details 3: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 4: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 5: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 6: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 7: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 8: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 9: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 10: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 11: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 12: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 13: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 14: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 15: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 16: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 17: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 18: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED

sessionInfo()

R version 4.1.3 (2022-03-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 [4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] MoonlightR_1.20.0 doParallel_1.0.17
[3] iterators_1.0.14 foreach_1.5.2
[5] SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[9] IRanges_2.28.0 S4Vectors_0.32.4
[11] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.62.0 TCGAbiolinks_2.25.0

loaded via a namespace (and not attached): [1] shadowtext_0.1.2 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.2.1
[5] plyr_1.8.7 igraph_1.3.1
[7] lazyeval_0.2.2 splines_4.1.3
[9] BiocParallel_1.28.3 ggplot2_3.3.6
[11] digest_0.6.29 yulab.utils_0.0.4
[13] htmltools_0.5.2 GOSemSim_2.20.0
[15] viridis_0.6.2 GO.db_3.14.0
[17] fansi_1.0.3 magrittr_2.0.3
[19] memoise_2.0.1 tzdb_0.3.0
[21] limma_3.50.3 Biostrings_2.62.0
[23] readr_2.1.2 graphlayouts_0.8.0
[25] vroom_1.5.7 R.utils_2.11.0
[27] enrichplot_1.14.2 prettyunits_1.1.1
[29] jpeg_0.1-9 colorspace_2.0-3
[31] blob_1.2.3 rvest_1.0.2
[33] rappdirs_0.3.3 ggrepel_0.9.1
[35] xfun_0.30 dplyr_1.0.9
[37] tcltk_4.1.3 crayon_1.5.1
[39] RCurl_1.98-1.6 jsonlite_1.8.0
[41] scatterpie_0.1.7 GEOquery_2.62.2
[43] ape_5.6-2 glue_1.6.2
[45] polyclip_1.10-0 gtable_0.3.0
[47] zlibbioc_1.40.0 XVector_0.34.0
[49] DelayedArray_0.20.0 shape_1.4.6
[51] scales_1.2.0 DOSE_3.20.1
[53] HiveR_0.3.63 DBI_1.1.2
[55] Rcpp_1.0.8.3 viridisLite_0.4.0
[57] progress_1.2.2 gridGraphics_0.5-1
[59] tidytree_0.3.9 bit_4.0.4
[61] htmlwidgets_1.5.4 httr_1.4.3
[63] fgsea_1.20.0 gplots_3.1.3
[65] RColorBrewer_1.1-3 ellipsis_0.3.2
[67] R.methodsS3_1.8.1 pkgconfig_2.0.3
[69] XML_3.99-0.9 farver_2.1.0
[71] dbplyr_2.1.1 utf8_1.2.2
[73] RISmed_2.3.0 ggplotify_0.1.0
[75] tidyselect_1.1.2 rlang_1.0.2
[77] reshape2_1.4.4 AnnotationDbi_1.56.2
[79] munsell_0.5.0 tools_4.1.3
[81] cachem_1.0.6 downloader_0.4
[83] cli_3.3.0 generics_0.1.2
[85] RSQLite_2.2.13 stringr_1.4.0
[87] fastmap_1.1.0 ggtree_3.2.1
[89] knitr_1.39 bit64_4.0.5
[91] tidygraph_1.2.1 caTools_1.18.2
[93] rgl_0.108.3 randomForest_4.7-1
[95] purrr_0.3.4 KEGGREST_1.34.0
[97] ggraph_2.0.5 nlme_3.1-157
[99] R.oo_1.24.0 aplot_0.1.4
[101] DO.db_2.9 xml2_1.3.3
[103] biomaRt_2.50.3 compiler_4.1.3
[105] filelock_1.0.2 curl_4.3.2
[107] png_0.1-7 treeio_1.18.1
[109] tibble_3.1.7 tweenr_1.0.2
[111] stringi_1.7.6 TCGAbiolinksGUI.data_1.15.1 [113] lattice_0.20-45 Matrix_1.4-1
[115] vctrs_0.4.1 pillar_1.7.0
[117] lifecycle_1.0.1 GlobalOptions_0.1.2
[119] parmigene_1.1.0 data.table_1.14.2
[121] bitops_1.0-7 patchwork_1.1.1
[123] qvalue_2.26.0 R6_2.5.1
[125] KernSmooth_2.23-20 gridExtra_2.3
[127] codetools_0.2-18 gtools_3.9.2
[129] MASS_7.3-57 assertthat_0.2.1
[131] withr_2.5.0 GenomeInfoDbData_1.2.7
[133] hms_1.1.1 clusterProfiler_4.2.2
[135] grid_4.1.3 ggfun_0.0.6
[137] tidyr_1.2.0 ggforce_0.3.3

May 06 '22 05:05 git-jrwang

@g27182818, Hi are you using the R 4.1.2 version or R.4.2.0 version ?

May 20 '22 10:05 qiz218591

TCGAbiolinks TCGAbiolinks copied to clipboard

GDCprepare() does not work with last update of GDC v32.0 for RNA-Seq

Download data using api

TCGAbiolinks
TCGAbiolinks copied to clipboard