TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

GDCprepare() does not work with last update of GDC v32.0 for RNA-Seq

Open g27182818 opened this issue 3 years ago • 16 comments

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"
# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")
# Read downloaded data and get a single a summarized experiment object
data <- GDCprepare(query,
                   summarizedExperiment = TRUE)

Which produces the following error:

> data <- GDCprepare(query)
|===========================================================================================================|100%                      Completed after 13 s 
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Locations 2, 3, and 4 don't exist.
i There are only 1 column.
Run `rlang::last_error()` to see where the error occurred.
There were 50 or more warnings (use warnings() to see the first 50)

g27182818 avatar Mar 29 '22 23:03 g27182818

Have you solved this problem?

guohout avatar Mar 30 '22 13:03 guohout

As I understand the problem is that now the STAR-Count files come with much more information and hence the prepareGDC() funciton is unable to read this new format. However I decided to open each downloaded file individually and append each needed column in a dataframe. The code I´m using now is this:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"

# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Get metadata matrix
metadata <- query[[1]][[1]]

# Download data using api
GDCdownload(query, method = "api")

# Get main directory where data is stored
main_dir <- file.path("GDCdata", project_name)
# Get file list of downloaded files
file_list <- file.path("GDCdata", project_name,list.files(main_dir,recursive = TRUE)) 

# Read first downloaded to get gene names
test_tab <- read.table(file = file_list[1], sep = '\t', header = TRUE)
# Delete header lines that don't contain usefull information
test_tab <- test_tab[-c(1:4),]
# STAR counts and tpm datasets
tpm_data_frame <- data.frame(test_tab[,1])
count_data_frame <- data.frame(test_tab[,1])

# Append cycle to get the complete matrix
for (i in c(1:length(file_list))) {
  # Read table
  test_tab <- read.table(file = file_list[i], sep = '\t', header = TRUE)
  # Delete not useful lines
  test_tab <- test_tab[-c(1:4),]
  # Column bind of tpm and counts data
  tpm_data_frame <- cbind(tpm_data_frame, test_tab[,7])
  count_data_frame <- cbind(count_data_frame, test_tab[,4])
  # Print progres from 0 to 1
  print(i/length(file_list))
}

This works and gets the data but is much slower than the original GDCprepare() function.

g27182818 avatar Mar 30 '22 15:03 g27182818

Thanks for your approach

guohout avatar Mar 30 '22 15:03 guohout

Also had a simliar issue, but now fixed with the update to 2.23.6, i.e. BiocManager::install("BioinformaticsFMRP/TCGAbiolinks"). Thanks for the workaround @g27182818 and the quick update @tiagochst!

t-carroll avatar Mar 30 '22 22:03 t-carroll

Just in case someone has the same problem as me, BiocManager::install("BioinformaticsFMRP/TCGAbiolinks") was showing the following error:

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace 'TCGAbiolinksGUI.data' 1.14.0 is being loaded, but >= 1.15.1 is required
Calls: <Anonymous> ... withCallingHandlers -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted
ERROR: lazy loading failed for package 'TCGAbiolinks'
* removing 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
* restoring previous 'C:/Users/Usuario/OneDrive/Documentos/R/win-library/4.1/TCGAbiolinks'
Installation paths not writeable, unable to update packages
  path: C:/Program Files/R/R-4.1.2/library
  packages:
    class, cluster, foreign, MASS, Matrix, mgcv, nlme, nnet, rpart, spatial, survival
Warning message:
In i.p(...) :
  installation of package ‘C:/Users/Usuario/AppData/Local/Temp/RtmpKWbI9z/filec681ec526c7/TCGAbiolinks_2.23.7.tar.gz’ had non-zero exit status

And it was because the package TCGAbiolinksGUI.data had to be also installed directly from GitHub. So, the final way to access the new GDCprepare() function is:

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

This will first update the TCGAbiolinksGUI.data to latest 1.15.1 version and then install the fixed version of TCGAbiolinks.

g27182818 avatar Mar 31 '22 16:03 g27182818

Yes, I am still updating the package. It might be stable in the next few days. I updated the gene information to use GENCODE v36 as GDC is now using. That is why I need to update TCGAbiolinksGUI.data.

tiagochst avatar Mar 31 '22 17:03 tiagochst

in my case,

BiocManager::install("BioinformaticsFMRP/TCGAbiolinksGUI.data")
BiocManager::install("ExperimentHub")

Restart R

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")

works

hyjforesight avatar Apr 04 '22 05:04 hyjforesight

I've tried all week! @hyjforesight saved me

aysenuroner avatar Apr 06 '22 22:04 aysenuroner

this is a good step, but I think sample names are missing in the matrix

snijesh avatar Apr 11 '22 05:04 snijesh

It takes a very long time after 100% prepare. My console is still busy, is it normal? Should add a notion for such case?

> library(TCGAbiolinks)
> proj <- "TCGA-STAD"
> query <- GDCquery(
+   project = proj,
+   data.category = "Transcriptome Profiling",
+   data.type = "Gene Expression Quantification",
+   workflow.type = "STAR - Counts"
+ )
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-STAD
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
> GDCdownload(query)
Downloading data for project TCGA-STAD
Of the 407 files for download 407 already exist.
All samples have been already downloaded
> data <- GDCprepare(query)
|==================================================================================================================|100%                      Completed after 50 s 

ShixiangWang avatar Apr 11 '22 07:04 ShixiangWang

I just found that the code below significantly slow the process.

https://github.com/BioinformaticsFMRP/TCGAbiolinks/blob/6cd187eb10b27b260e16c7cb25216fdef919d43d/R/prepare.R#L1448-L1451

Instead, use data.table will speed up:

df = rbindlist(x, use.names = TRUE, idcol = "case_barcode")
data.table::dcast(df, gene_id + gene_name + gene_type ~ case_barcode, value.var = colnames(df)[-c(1:4)])

ShixiangWang avatar Apr 11 '22 08:04 ShixiangWang

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')
project_name <- "TCGA-ACC"
# Defines the query to the GDC
query <- GDCquery(project = project_name,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  experimental.strategy = "RNA-Seq",
                  workflow.type = "STAR - Counts")

# Download data using api
GDCdownload(query, method = "api")
# Read downloaded data and get a single a summarized experiment object
data <- GDCprepare(query,
                   summarizedExperiment = TRUE)

Which produces the following error:

> data <- GDCprepare(query)
|===========================================================================================================|100%                      Completed after 13 s 
Error in `stop_subscript()`:
! Can't subset columns that don't exist.
x Locations 2, 3, and 4 don't exist.
i There are only 1 column.
Run `rlang::last_error()` to see where the error occurred.
There were 50 or more warnings (use warnings() to see the first 50)

Could you please update the tutorials accordingly?

Thanks.

sciencepeak avatar Apr 15 '22 18:04 sciencepeak

They are being update in the devel version at bioconductor.

https://bioconductor.org/packages/3.15/bioc/vignettes/TCGAbiolinks/inst/doc/index.html

You also need to update the package with the GitHub version.

On Fri, Apr 15, 2022, 2:20 PM Science Peak @.***> wrote:

After the new release of GDC made on March 29, 2022 the GDCDownload() function still works but the GDCprepare() function gives an error when the query is for RNA-Seq data. Here is the minimal code to reproduce the issue:

library('TCGAbiolinks')project_name <- "TCGA-ACC"# Defines the query to the GDCquery <- GDCquery(project = project_name, data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", experimental.strategy = "RNA-Seq", workflow.type = "STAR - Counts")

Download data using api

GDCdownload(query, method = "api")# Read downloaded data and get a single a summarized experiment objectdata <- GDCprepare(query, summarizedExperiment = TRUE)

Which produces the following error:

data <- GDCprepare(query) |===========================================================================================================|100% Completed after 13 s Error in stop_subscript(): ! Can't subset columns that don't exist. x Locations 2, 3, and 4 don't exist. i There are only 1 column. Run rlang::last_error() to see where the error occurred. There were 50 or more warnings (use warnings() to see the first 50)

Could you please update the tutorials https://bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/download_prepare.html#Search_and_download_data_from_legacy_database_using_GDC_api_method accordingly?

Thanks.

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/493#issuecomment-1100275620, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6P7CEQR2VOOU4XA22LVFGXORANCNFSM5SAASUWA . You are receiving this because you were mentioned.Message ID: @.***>

tiagochst avatar Apr 15 '22 18:04 tiagochst

I also met this problem!

PearlLiu-Dev avatar Apr 27 '22 10:04 PearlLiu-Dev

I met similar problem. It is in SNP data. However, it is not an error. It is warning, a lot of warning.

The code:

query_snp <- GDCquery( project = paste0("TCGA-", cancerType), data.category = "Simple Nucleotide Variation", data.type = "Masked Somatic Mutation", access = "open" )

GDCdownload(query=query_snp, method = "api", directory = DataDir)

maf <- GDCprepare(query = query_snp, directory = DataDir, save = TRUE, save.filename = "SNP_COAD_data.rda")

There were 50 or more warnings (use warnings() to see the first 50) warnings() 警告資訊: 1: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 2: One or more parsing issues, see problems() for details 3: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 4: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 5: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 6: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 7: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 8: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 9: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 10: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 11: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 12: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 13: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 14: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 15: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 16: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 17: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED 18: The following named parsers don't match the column names: ALLELE_NUM, MINIMISED

sessionInfo()

R version 4.1.3 (2022-03-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 [2] LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 [4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base

other attached packages: [1] MoonlightR_1.20.0 doParallel_1.0.17
[3] iterators_1.0.14 foreach_1.5.2
[5] SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[9] IRanges_2.28.0 S4Vectors_0.32.4
[11] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.62.0 TCGAbiolinks_2.25.0

loaded via a namespace (and not attached): [1] shadowtext_0.1.2 circlize_0.4.14
[3] fastmatch_1.1-3 BiocFileCache_2.2.1
[5] plyr_1.8.7 igraph_1.3.1
[7] lazyeval_0.2.2 splines_4.1.3
[9] BiocParallel_1.28.3 ggplot2_3.3.6
[11] digest_0.6.29 yulab.utils_0.0.4
[13] htmltools_0.5.2 GOSemSim_2.20.0
[15] viridis_0.6.2 GO.db_3.14.0
[17] fansi_1.0.3 magrittr_2.0.3
[19] memoise_2.0.1 tzdb_0.3.0
[21] limma_3.50.3 Biostrings_2.62.0
[23] readr_2.1.2 graphlayouts_0.8.0
[25] vroom_1.5.7 R.utils_2.11.0
[27] enrichplot_1.14.2 prettyunits_1.1.1
[29] jpeg_0.1-9 colorspace_2.0-3
[31] blob_1.2.3 rvest_1.0.2
[33] rappdirs_0.3.3 ggrepel_0.9.1
[35] xfun_0.30 dplyr_1.0.9
[37] tcltk_4.1.3 crayon_1.5.1
[39] RCurl_1.98-1.6 jsonlite_1.8.0
[41] scatterpie_0.1.7 GEOquery_2.62.2
[43] ape_5.6-2 glue_1.6.2
[45] polyclip_1.10-0 gtable_0.3.0
[47] zlibbioc_1.40.0 XVector_0.34.0
[49] DelayedArray_0.20.0 shape_1.4.6
[51] scales_1.2.0 DOSE_3.20.1
[53] HiveR_0.3.63 DBI_1.1.2
[55] Rcpp_1.0.8.3 viridisLite_0.4.0
[57] progress_1.2.2 gridGraphics_0.5-1
[59] tidytree_0.3.9 bit_4.0.4
[61] htmlwidgets_1.5.4 httr_1.4.3
[63] fgsea_1.20.0 gplots_3.1.3
[65] RColorBrewer_1.1-3 ellipsis_0.3.2
[67] R.methodsS3_1.8.1 pkgconfig_2.0.3
[69] XML_3.99-0.9 farver_2.1.0
[71] dbplyr_2.1.1 utf8_1.2.2
[73] RISmed_2.3.0 ggplotify_0.1.0
[75] tidyselect_1.1.2 rlang_1.0.2
[77] reshape2_1.4.4 AnnotationDbi_1.56.2
[79] munsell_0.5.0 tools_4.1.3
[81] cachem_1.0.6 downloader_0.4
[83] cli_3.3.0 generics_0.1.2
[85] RSQLite_2.2.13 stringr_1.4.0
[87] fastmap_1.1.0 ggtree_3.2.1
[89] knitr_1.39 bit64_4.0.5
[91] tidygraph_1.2.1 caTools_1.18.2
[93] rgl_0.108.3 randomForest_4.7-1
[95] purrr_0.3.4 KEGGREST_1.34.0
[97] ggraph_2.0.5 nlme_3.1-157
[99] R.oo_1.24.0 aplot_0.1.4
[101] DO.db_2.9 xml2_1.3.3
[103] biomaRt_2.50.3 compiler_4.1.3
[105] filelock_1.0.2 curl_4.3.2
[107] png_0.1-7 treeio_1.18.1
[109] tibble_3.1.7 tweenr_1.0.2
[111] stringi_1.7.6 TCGAbiolinksGUI.data_1.15.1 [113] lattice_0.20-45 Matrix_1.4-1
[115] vctrs_0.4.1 pillar_1.7.0
[117] lifecycle_1.0.1 GlobalOptions_0.1.2
[119] parmigene_1.1.0 data.table_1.14.2
[121] bitops_1.0-7 patchwork_1.1.1
[123] qvalue_2.26.0 R6_2.5.1
[125] KernSmooth_2.23-20 gridExtra_2.3
[127] codetools_0.2-18 gtools_3.9.2
[129] MASS_7.3-57 assertthat_0.2.1
[131] withr_2.5.0 GenomeInfoDbData_1.2.7
[133] hms_1.1.1 clusterProfiler_4.2.2
[135] grid_4.1.3 ggfun_0.0.6
[137] tidyr_1.2.0 ggforce_0.3.3

git-jrwang avatar May 06 '22 05:05 git-jrwang

@g27182818, Hi are you using the R 4.1.2 version or R.4.2.0 version ?

qiz218591 avatar May 20 '22 10:05 qiz218591