Create system to validate node IDs, labels, and names
We currently do not have any validation system in place to confirm that the node IDs used in each record use the correct names and labels. While our curators are very detail-oriented and careful, the lack of automated checks leaves open the risk that data errors are introduced. We should create a simple script that queries each ID against some authoritative resource(s) (e.g., mygene.info/mychem.info/mydisease.info, OLS, UMLS, NodeNormalizer). More details below...
The first record in the indication_paths.yaml file is here:
- directed: true
graph:
disease: CML (ph+)
disease_mesh: MESH:D015464
drug: imatinib
drug_mesh: MESH:D000068877
drugbank: DB:DB00619
links:
- key: decreases activity of
source: MESH:D000068877
target: UniProt:P00519
- key: causes
source: UniProt:P00519
target: MESH:D015464
multigraph: true
nodes:
- id: MESH:D000068877
label: Drug
name: imatinib
- id: UniProt:P00519
label: Protein
name: BCR/ABL
- id: MESH:D015464
label: Disease
name: CML (ph+)
There are three IDs under nodes for MESH:D000068877, UniProt:P00519, and MESH:D015464. If we look up the first ID in the MeSH API here: https://id.nlm.nih.gov/mesh/lookup/details?descriptor=D000068877, we see that the preferred name for MESH:D000068877 is actually Imatinib Mesylate, and the preferred name for MESH:D015464 is Leukemia, Myelogenous, Chronic, BCR-ABL Positive. The script described here would output a version of the input YAML with the names replaced by the "preferred name" from the MeSH API.
The most common identifiers used in DMDB are shown here (with counts):
$ cat indication_paths.yaml.4 | grep ' id:' | sed 's/.*id: //;s/:.*//' | sort | uniq -c | sort -k1nr
8695 MESH
6227 GO
4359 UniProt
1014 HP
741 NCBITaxon
601 InterPro
429 CHEBI
428 UBERON
339 REACT
218 DB
198 CL
48 Pfam
18 PR
7 "REACT
3 TIGR
1 "InterPro
So let's start with MeSH as the most common identifier used. After that, we'll expand to the other identifier types.