Cell line metadata incorrectly processed
It appears that the cell line metadata for GDSC1000 (and likely other datasets, assuming they were processed in a similar manner) is corrupted some cell lines.
Specifically, cell lines which contain invalid R row name characters appear to be improperly handled.
For example, take the cell line NB(TU)1-10:
In the PharamCoGx cell line metadata, it gets converted to "NB-TU-1-10", and many of the metadata fields end up as NA (likely due to mismatched IDs):
To reproduce:
library(PharmacoGx)
library(tidyverse)
GDSC <- downloadPSet("GDSC1000")
GDSC@cell[18, ] %>%
glimpse()
# Observations: 1
# Variables: 15
# $ Sample.Name <chr> NA
# $ COSMIC.identifier <int> NA
# $ Whole.Exome.Sequencing..WES. <chr> NA
# $ Copy.Number.Alterations..CNA. <chr> NA
# $ Gene.Expression <chr> NA
# $ Methylation <chr> NA
# $ Drug.Response <chr> NA
# $ GDSC.Tissue.descriptor.1 <chr> NA
# $ GDSC.Tissue.descriptor.2 <chr> NA
# $ Cancer.Type..matching.TCGA.label. <chr> NA
# $ Microsatellite..instability.Status..MSI. <chr> NA
# $ Screen.Medium <chr> NA
# $ Growth.Properties <chr> NA
# $ cellid <chr> "NB-TU-1-10"
# $ tissueid <chr> "kidney"
#
# est. total number of cell lines affected
sum(is.na(GDSC@cell$Sample.Name))
# 109
Based on the presence of missing values, this appears to affect ~109 of the cell lines in GDSC1000, and is likely also related to issue #40 .
System info:
- R 3.6.2
- PharmacoGx_1.17.1
@khughitt we are currently working in finalizing a reannotation of all the cell lines based on mapping to a standardize name from the wonderful resource Cellosaurus: https://web.expasy.org/cellosaurus/
This should clear up any of the mismapping issues. However, specifically for the GDSC1000 dataset, a fraction of the cell lines without sample names are due to the fact that they did in fact have no sample name annotated. Some of the cell lines were only measured in only the microarray and not included in the cell line metadata files released with the study.
@p-smirnov Thanks for the quick follow-up and clarifications!
It looks like if you use the Cellosaurus accession for rownames, that will be safe from getting butchered by make.names()..
That is surprising about the GDSC1000 cell lines,. although, perhaps it shouldn't be all that surprising given my luck working with metadata for public datasets heh.
Have you considered reaching out to someone involved with GDSC to see if they might have that information saved somewhere? That seems like it could be generally useful to share with the community if they do..
And thanks for sharing the cell line resource! That looks very useful.