TCGAbiolinks icon indicating copy to clipboard operation
TCGAbiolinks copied to clipboard

Data inconsistencies TCGAbiolinks vs. GDC Data Portal for NCICCR-DLBCL

Open k0n5 opened this issue 4 years ago • 3 comments

Hi Tiago,

thank you for your incredibly helpful package, I used it a lot recently and it works great. (I am using Bioconductor 3.10, BiocManager 1.30.10, R 3.6.3).

For the NCICCR-DLBCL project, however, I came across behavior that seems unexpected to me and that worries me a lot. At least for some cases, "days to last follow up" does not match between the GDC Data portal and the data I download through TCGAbiolinks.

I take DLBCL10782 as a minimal example:

DLBCL10782-survival-browser

Now with TCGAbiolinks:

BiocManager::install("BioinformaticsFMRP/TCGAbiolinks")
library(TCGAbiolinks)
library(SummarizedExperiment)
rna_query = GDCquery(project = "NCICCR-DLBCL",
                     data.category = "Transcriptome Profiling",
                     data.type = "Gene Expression Quantification", 
                     workflow.type = "HTSeq - Counts")
GDCdownload(rna_query)
rna = GDCprepare(rna_query)
metadata = data.frame(colData(rna)) 
print(metadata[metadata$sample_submitter_id == "DLBCL10782-sample", "days_to_last_follow_up"])
[1] 4628

Why are "days to last follow up" 1301 in the GDC data portal and 4628 in the TCGAbiolinks-downloaded data?

What strikes me as odd: For each "days to last follow up" value I manually look up in the GDC data portal, I find a different sample in the TCGAbiolinks-downloaded data with the same "days to last follow up" value. Could there be a mix-up of some sort?

Anyways, any help is much appreciated. Thank you a lot for your efforts!

k0n5 avatar Apr 20 '20 09:04 k0n5

This is still an issue!!!! It appears to be an ID swap as far as I can tell since as the OP says, there are rows with the same values but different IDs. So if you're using this clinical data to match to the omics data, it will be wrong.

nerdcommander avatar Jun 13 '22 22:06 nerdcommander

This should be solved now: https://rpubs.com/tiagochst/issue_399_TCGA-NCICCR-DLBCL

tiagochst avatar Jun 14 '22 14:06 tiagochst

spot checked a few of the columns and the data from TCGAbiolinks now matches GDC. thanks!

nerdcommander avatar Jun 14 '22 17:06 nerdcommander