Unmatched subtype information between `SummarizedExperiment` object vs `TCGAquery_subtype`
Hello,
Thanks for this very helpful package. I have a question regarding accessing the subtype information associated with TCGA projects (in this example, specifically COAD but my question applies to other projects including SKCM for instance) .
When I download the RNAseq experiment as a SummarizedExperiment object I can access the metadata associated with the samples by calling colData(coad). In this data frame, there is information regarding MSI (microsatellite instability) status of tumors. The information I get from there is the following:
# Prepared coad object previously by using GDCdownload and GDCprepare functions
meta <- as.data.frame(colData(coad))
dim(meta)
#>[1] 521 102
summary(meta$subtype_MSI_status)
#> MSI-H MSI-L MSS Not Evaluable NA's
#> 0 40 42 126 0 313
Alternatively, I can also download subtype information using TCGAquery_subtype function. When I do that and look at the MSI data in the downloaded data frame, this is what I see:
subtype <- TCGAbiolinks::TCGAquery_subtype("COAD")
dim(subtype)
#>[1] 276 45
summary(subtype$MSI_status)
#> MSI-H MSI-L MSS Not Evaluable
#> 0 38 44 193 1
A similar discrepancy is also present when comparing survival times between SummarizedExperiment and TCGAquery_subtype data frames. One has a shorter followup time than the other for some patients (ie. the patient is censored at an early date with alive vital_status in one data frame whereas he/she appears deceased in the other data frame at a later time point.
What is the reason for the discrepancy between different subtype data? I remember having similar issues with SKCM (didn't try the others much). I would appreciate if you can let me know which is the more accurate version to use.
Best, Atakan
I wrote the code I used here to check some of the data: https://rpubs.com/tiagochst/TCGAbiolinks_Checking_subtype_information
The TCGAquery_subtype is accessing metadata retrieved from published TCGA papers, which might contain more samples than the RNA-Seq data. For example, there might be samples with only DNA methylation.
There are a couple of reason why you can have increase of MSI-H. The papers normally annotate only the patient instead of the sample (TCGA-A6-2672). The patient might might have duplicated samples: "TCGA-A6-2672-01A" "TCGA-A6-2672-01B". That is one of the reasons of the increase. Since we did not check deeply which replicates were used in each paper.
For the survival, GDC might have mored updated data compared to the paper when it was published.
But that is also something you need to look deeply.
By the way, one thing that I need to check better later, but in the SummarizedExperiment we also added the paper information to the normal samples, which is trick, since I believe the information belongs to primary tumor samples.
Thanks for the insights! I will take a closer look as soon as I can. The reasons for the differences make sense. I will compare both ways of getting the metadata and pick the most current/comprehensive version for analyses in the future. Do you know if there is an effort at the GDC or TCGAbiolinks level to standardize data downloads using the most current data?