TCGAbiolinks
TCGAbiolinks copied to clipboard
TCGAbiolinks download repeat clinical data and case count not the same with getProjectSummary
rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT) query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", file.type = "xml") GDCdownload(query) clinical <- GDCprepare_clinic(query, clinical.info = "patient") write.table(clinical, 'b.txt', sep="\t", row.names=F,col.names=T,quote=F)
you will see it has three duplicate bcr_patient_barcode
TCGA-3P-A9WA TCGA-59-A5PD TCGA-5X-AA5U ,waiting for your help, thanks a lot
============================================================================================================================
here is the resulr of TCGA-OV, you give clinical case_count is 608, but as the above getthing clinical data, even including the three repeat data, it is 590, not 608, so what is the real case count
I'll check the count and the duplicated sample. I did not touch that function for a long time.
thanks a lot, waiting for your reply
It seems TCGA-OV has 608 cases, but only 587 have clinical data. The numbers are the same from GDC data portal as shown below:
data:image/s3,"s3://crabby-images/63be2/63be28f25453811a42589e927c5c27d4016adb54" alt="Screen Shot 2019-09-03 at 10 41 05 AM"
data:image/s3,"s3://crabby-images/b8289/b82897d11711bdc8f8e71465faa5ef3c914cdb09" alt="Screen Shot 2019-09-03 at 10 47 48 AM"
Example of case missing clinical data:
Code to check the samples missing: http://rpubs.com/tiagochst/TCGA-OV-cases
Thanks a lot. so what is the right way to get clinical data? query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", file.type = "xml") GDCdownload(query) clinical <- GDCprepare_clinic(query, clinical.info = "patient") # and then dedup myself?
I suggest using the indexed data:
clinical.indexed <- GDCquery_clinic(project = "TCGA-OV", type = "clinical")
it is really a pity, use the same code, I get not the same data as you
after write.table(clinical.indexed,"ov_clinical",sep="\t",row.names=F,col.names=T,quote=F) and you can see the result is double rownames, and most content is NA
Which version of TCGAbiolinks do you have installed? It seems it is an old one. Could you please update it from Github with:
withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"),
remotes::install_github('BioinformaticsFMRP/TCGAbiolinks')
)
I install from the bioconductor before.
after installing from your suggestion, the counts now is right, but the content is of courese not I wanted
at least, it lacks tumor stage information,
here is my code rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT) clinical.indexed <- GDCquery_clinic(project = "TCGA-OV", type = "clinical")
where I am wrong, Thanks a lot
The indexed data is parsed from the XML files. It seems there is a problem with the parsing. You can get that information in the Biotab or XML.
query <- GDCquery(project = "TCGA-OV",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
clinical.BCRtab.all$clinical_patient_ov$tumor_grade
query <- GDCquery(project = "TCGA-OV",
data.category = "Clinical",
data.type = "Clinical Supplement",
data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
clinical.BCRtab.all$clinical_patient_ov$clinical_stage
data:image/s3,"s3://crabby-images/84004/84004b1c8641ed4e0c34a6b890068b853be9edc1" alt="Screen Shot 2019-09-05 at 9 56 10 AM"
I am so sorry, no matter install tcgabiolink from bioconductor or the method
withr::with_envvar(c(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), remotes::install_github('BioinformaticsFMRP/TCGAbiolinks') ) you supplied.
running code
rm(list=ls()) library(TCGAbiolinks) library(dplyr) library(DT)
query <- GDCquery(project = "TCGA-OV", data.category = "Clinical", data.type = "Clinical Supplement", data.format = "BCR Biotab") GDCdownload(query) clinical.BCRtab.all <- GDCprepare(query)
both will report the same error
so I check the function , here also says no argument
sorry for disturbing you so many times, hope package TCGAbiolinks will become a more outstanding package for analysing tcga data