Create system to validate node IDs, labels, and names

Open andrewsu opened this issue 3 years ago • 0 comments

We currently do not have any validation system in place to confirm that the node IDs used in each record use the correct names and labels. While our curators are very detail-oriented and careful, the lack of automated checks leaves open the risk that data errors are introduced. We should create a simple script that queries each ID against some authoritative resource(s) (e.g., mygene.info/mychem.info/mydisease.info, OLS, UMLS, NodeNormalizer). More details below...

The first record in the indication_paths.yaml file is here:

-   directed: true
    graph:
        disease: CML (ph+)
        disease_mesh: MESH:D015464
        drug: imatinib
        drug_mesh: MESH:D000068877
        drugbank: DB:DB00619
    links:
    -   key: decreases activity of
        source: MESH:D000068877
        target: UniProt:P00519
    -   key: causes
        source: UniProt:P00519
        target: MESH:D015464
    multigraph: true
    nodes:
    -   id: MESH:D000068877
        label: Drug
        name: imatinib
    -   id: UniProt:P00519
        label: Protein
        name: BCR/ABL
    -   id: MESH:D015464
        label: Disease
        name: CML (ph+)

There are three IDs under nodes for MESH:D000068877, UniProt:P00519, and MESH:D015464. If we look up the first ID in the MeSH API here: https://id.nlm.nih.gov/mesh/lookup/details?descriptor=D000068877, we see that the preferred name for MESH:D000068877 is actually Imatinib Mesylate, and the preferred name for MESH:D015464 is Leukemia, Myelogenous, Chronic, BCR-ABL Positive. The script described here would output a version of the input YAML with the names replaced by the "preferred name" from the MeSH API.

The most common identifiers used in DMDB are shown here (with counts):

$ cat indication_paths.yaml.4 | grep ' id:' | sed 's/.*id: //;s/:.*//' | sort | uniq -c | sort -k1nr
   8695 MESH
   6227 GO
   4359 UniProt
   1014 HP
    741 NCBITaxon
    601 InterPro
    429 CHEBI
    428 UBERON
    339 REACT
    218 DB
    198 CL
     48 Pfam
     18 PR
      7 "REACT
      3 TIGR
      1 "InterPro

So let's start with MeSH as the most common identifier used. After that, we'll expand to the other identifier types.

Feb 11 '22 23:02 andrewsu