TCGAbiolinks
TCGAbiolinks copied to clipboard
Data inconsistencies TCGAbiolinks vs. GDC Data Portal for NCICCR-DLBCL
Hi Tiago,
thank you for your incredibly helpful package, I used it a lot recently and it works great. (I am using Bioconductor 3.10, BiocManager 1.30.10, R 3.6.3).
For the NCICCR-DLBCL project, however, I came across behavior that seems unexpected to me and that worries me a lot. At least for some cases, "days to last follow up" does not match between the GDC Data portal and the data I download through TCGAbiolinks.
I take DLBCL10782 as a minimal example:
data:image/s3,"s3://crabby-images/daded/dadedf96e73954ac32648b98a909abdcb6104ec0" alt="DLBCL10782-survival-browser"
Now with TCGAbiolinks:
BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
library(TCGAbiolinks)
library(SummarizedExperiment)
rna_query = GDCquery(project = "NCICCR-DLBCL",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts")
GDCdownload(rna_query)
rna = GDCprepare(rna_query)
metadata = data.frame(colData(rna))
print(metadata[metadata$sample_submitter_id == "DLBCL10782-sample", "days_to_last_follow_up"])
[1] 4628
Why are "days to last follow up" 1301 in the GDC data portal and 4628 in the TCGAbiolinks-downloaded data?
What strikes me as odd: For each "days to last follow up" value I manually look up in the GDC data portal, I find a different sample in the TCGAbiolinks-downloaded data with the same "days to last follow up" value. Could there be a mix-up of some sort?
Anyways, any help is much appreciated. Thank you a lot for your efforts!
This is still an issue!!!! It appears to be an ID swap as far as I can tell since as the OP says, there are rows with the same values but different IDs. So if you're using this clinical data to match to the omics data, it will be wrong.
This should be solved now: https://rpubs.com/tiagochst/issue_399_TCGA-NCICCR-DLBCL
spot checked a few of the columns and the data from TCGAbiolinks now matches GDC. thanks!