TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

TCGAbiolinks download repeat clinical data and case count not the same with getProjectSummary

Open wentgithub opened this issue 5 years ago • 10 comments

rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT) query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", file.type = "xml") GDCdownload(query) clinical <- GDCprepare_clinic(query, clinical.info = "patient") write.table(clinical, 'b.txt', sep="\t", row.names=F,col.names=T,quote=F)

you will see it has three duplicate bcr_patient_barcode

TCGA-3P-A9WA TCGA-59-A5PD TCGA-5X-AA5U ,waiting for your help, thanks a lot

============================================================================================================================ here is the resulr of TCGA-OV, you give clinical case_count is 608, but as the above getthing clinical data, even including the three repeat data, it is 590, not 608, so what is the real case count image

wentgithub avatar Sep 01 '19 11:09 wentgithub

I'll check the count and the duplicated sample. I did not touch that function for a long time.

tiagochst avatar Sep 02 '19 14:09 tiagochst

thanks a lot, waiting for your reply

wentgithub avatar Sep 02 '19 14:09 wentgithub

It seems TCGA-OV has 608 cases, but only 587 have clinical data. The numbers are the same from GDC data portal as shown below:

Screen Shot 2019-09-03 at 10 41 05 AM Screen Shot 2019-09-03 at 10 47 48 AM

Example of case missing clinical data: Screen Shot 2019-09-03 at 10 55 10 AM

Code to check the samples missing: http://rpubs.com/tiagochst/TCGA-OV-cases

tiagochst avatar Sep 03 '19 13:09 tiagochst

Thanks a lot. so what is the right way to get clinical data? query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", file.type = "xml") GDCdownload(query) clinical <- GDCprepare_clinic(query, clinical.info = "patient") # and then dedup myself?

wentgithub avatar Sep 03 '19 15:09 wentgithub

I suggest using the indexed data: clinical.indexed <- GDCquery_clinic(project = "TCGA-OV", type = "clinical")

tiagochst avatar Sep 04 '19 19:09 tiagochst

it is really a pity, use the same code, I get not the same data as you

image

after write.table(clinical.indexed,"ov_clinical",sep="\t",row.names=F,col.names=T,quote=F) and you can see the result is double rownames, and most content is NA

image

wentgithub avatar Sep 05 '19 01:09 wentgithub

Which version of TCGAbiolinks do you have installed? It seems it is an old one. Could you please update it from Github with:

withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), 
  remotes::install_github('BioinformaticsFMRP/TCGAbiolinks')
)

tiagochst avatar Sep 05 '19 12:09 tiagochst

I install from the bioconductor before. after installing from your suggestion, the counts now is right, but the content is of courese not I wanted at least, it lacks tumor stage information, image

here is my code rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT) clinical.indexed <- GDCquery_clinic(project = "TCGA-OV", type = "clinical")

where I am wrong, Thanks a lot

wentgithub avatar Sep 05 '19 12:09 wentgithub

The indexed data is parsed from the XML files. It seems there is a problem with the parsing. You can get that information in the Biotab or XML.

query <- GDCquery(project = "TCGA-OV", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
clinical.BCRtab.all$clinical_patient_ov$tumor_grade
query <- GDCquery(project = "TCGA-OV", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
clinical.BCRtab.all$clinical_patient_ov$clinical_stage
Screen Shot 2019-09-05 at 9 56 10 AM

tiagochst avatar Sep 05 '19 12:09 tiagochst

I am so sorry, no matter install tcgabiolink from bioconductor or the method

withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), remotes::install_github('BioinformaticsFMRP/TCGAbiolinks') ) you supplied.

running code

rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT)

query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR Biotab") GDCdownload(query) clinical.BCRtab.all <- GDCprepare(query)

both will report the same error image so I check the function , here also says no argument image sorry for disturbing you so many times, hope package TCGAbiolinks will become a more outstanding package for analysing tcga data

wentgithub avatar Sep 06 '19 03:09 wentgithub