RTX-KG2 icon indicating copy to clipboard operation
RTX-KG2 copied to clipboard

Why do some UMLS nodes not have TUIs listed in the `description` field?

Open saramsey opened this issue 4 years ago • 5 comments

Thank you to Will Byrd for reporting this issue.

For many UMLS nodes in KG2, we include the semantic type (TUI) in the description field. But for some, we do not. For example, the Cypher query

match (n {id: 'UMLS:C0018681'}) return n.name, n.description

shows that for "headache", the description field includes the TUI, as expected. But for the Cypher query

match (n {id: 'UMLS:C0394007'}) return n.name, n.description

the result for "Cerebral Palsy" does not include the TUI in the description field. Why is that? (The subtext here is that Team Unsecret Agent in some cases uses the TUI information for KG2 UMLS nodes, so if we can provide it, that would be helpful to them).

Screen Shot 2021-05-26 at 8 52 57 AM

saramsey avatar May 26 '21 15:05 saramsey

see full text of Will Byrd's email in #56

saramsey avatar May 26 '21 16:05 saramsey

From KG2.6.7:

match (n) where (split(n.provided_by, ':')[0]='umls_source'or n.provided_by="identifiers_org_registry:umls") and not (n.description contains "UMLS_STY") return count(n)

count(n) --| 186647

match (n) where (split(n.provided_by, ':')[0]='umls_source'or n.provided_by="identifiers_org_registry:umls") and not (n.description contains "UMLS_STY") return n.category, n.provided_by, count(n)
n.category n.provided_by count(n)
"biolink:MolecularEntity" "identifiers_org_registry:umls" 20954
"biolink:ChemicalSubstance" "identifiers_org_registry:umls" 6376
"biolink:Drug" "identifiers_org_registry:umls" 2178
"biolink:InformationContentEntity" "umls_source:ATC" 3
"biolink:NamedThing" "identifiers_org_registry:umls" 1483
"biolink:IndividualOrganism" "identifiers_org_registry:umls" 6883
"biolink:AnatomicalEntity" "identifiers_org_registry:umls" 926
"biolink:InformationContentEntity" "umls_source:DRUGBANK" 2
"biolink:Protein" "identifiers_org_registry:umls" 896
"biolink:GrossAnatomicalStructure" "identifiers_org_registry:umls" 2214
"biolink:CellularComponent" "identifiers_org_registry:umls" 6060
"biolink:InformationContentEntity" "umls_source:FMA" 164
"biolink:Cell" "identifiers_org_registry:umls" 1059
"biolink:InformationContentEntity" "identifiers_org_registry:umls" 24006
"biolink:PhysiologicalProcess" "identifiers_org_registry:umls" 32797
"biolink:Disease" "identifiers_org_registry:umls" 19689
"biolink:GenomicEntity" "identifiers_org_registry:umls" 769
"biolink:DiseaseOrPhenotypicFeature" "identifiers_org_registry:umls" 17469
"biolink:MolecularActivity" "identifiers_org_registry:umls" 26543
"biolink:Phenomenon" "identifiers_org_registry:umls" 2946
"biolink:Activity" "identifiers_org_registry:umls" 1513
"biolink:PathologicalProcess" "identifiers_org_registry:umls" 1365
"biolink:InformationContentEntity" "umls_source:GO" 26
"biolink:Procedure" "identifiers_org_registry:umls" 7553
"biolink:Device" "identifiers_org_registry:umls" 697
"biolink:InformationContentEntity" "umls_source:HCPCS" 19
"biolink:InformationContentEntity" "umls_source:HGNC" 26
"biolink:InformationContentEntity" "umls_source:HL7" 59
"biolink:InformationContentEntity" "umls_source:HPO" 6
"biolink:PopulationOfIndividualOrganisms" "identifiers_org_registry:umls" 244
"biolink:InformationContentEntity" "umls_source:ICD10PCS" 3
"biolink:InformationContentEntity" "umls_source:ICD9CM" 7
"biolink:GeographicLocation" "identifiers_org_registry:umls" 341
"biolink:InformationContentEntity" "umls_source:LNC" 158
"biolink:Agent" "identifiers_org_registry:umls" 626
"biolink:InformationContentEntity" "umls_source:MEDLINEPLUS" 10
"biolink:InformationContentEntity" "umls_source:MED-RT" 6
"biolink:BiologicalEntity" "identifiers_org_registry:umls" 123
"biolink:InformationContentEntity" "umls_source:MSH" 37
"biolink:Carbohydrate" "identifiers_org_registry:umls" 1
"biolink:InformationContentEntity" "umls_source:NCBITAXON" 2
"biolink:InformationContentEntity" "umls_source:NCI" 271
"biolink:InformationContentEntity" "umls_source:NDDF" 5
"biolink:NamedThing" "umls_source:OMIM" 14
"biolink:InformationContentEntity" "umls_source:PDQ" 16
"biolink:InformationContentEntity" "umls_source:PSY" 7
"biolink:InformationContentEntity" "umls_source:RXNORM" 50
"biolink:InformationContentEntity" "umls_source:VANDF" 19
"biolink:InformationContentEntity" "umls_source:MTH" 26
match (n) where (n.provided_by="identifiers_org_registry:umls") return not (n.description contains "UMLS_STY"), split(n.id, ':')[0], count(n)
not (n.description contains "UMLS_STY") split(n.id, ':')[0] count(n)
null "UMLS" 2785105
false "UMLS" 157507
true "UMLS" 185711

ecwood avatar Jul 01 '21 17:07 ecwood

This is important for implementing #86.

ecwood avatar Jul 01 '21 17:07 ecwood

This is definitely still an issue, as of KG2.8.3:

match (n) where (n.provided_by in ["['infores:atc-codes-umls']", "['infores:cpt-codes-umls']", "['infores:drugbank']", "['infores:fma-umls']", "['infores:go']", "['infores:hcp-codes-umls']", "['infores:hcpcs-cpt-umls']", "['infores:hgnc']", "['infores:hl7-umls']", "['infores:hpo']", "['infores:icd10-umls']", "['infores:icd10ae-umls']", "['infores:icd10cm-umls']", "['infores:icd10pcs-umls']", "['infores:icd9cm-umls']", "['infores:loinc-umls']", "['infores:medrt-umls']", "['infores:meddra-umls']", "['infores:medlineplus']", "['infores:mesh']", "['infores:umls-metathesaurus']", "['infores:ncbi-taxonomy']", "['infores:ncit']", "['infores:nddf-umls']", "['infores:ndfrt']", "['infores:omim']", "['infores:pdq-umls']", "['infores:psy-umls']", "['infores:rxnorm']", "['infores:snomedct']", "['infores:vandf-umls']", "['infores:umls']"]) and not (n.description contains "STY") return n.provided_by, count(n) order by count(n) desc
n.provided_by count(n)
"['infores:umls']" 198443
"['infores:hpo']" 1720
"['infores:drugbank']" 879
"['infores:loinc-umls']" 139
"['infores:mesh']" 36
"['infores:atc-codes-umls']" 1

ecwood avatar Jul 11 '23 18:07 ecwood

In order to get some sample CURIES, I ran:

match (n) where (n.provided_by="['infores:hpo']") and not (n.description contains "STY") return n.id, n.name, n.description limit 10

since it's easy to identify the TTL file for this source and most of the issue nodes aren't biolink:InformationContentEntity nodes.

Here are the results:

n.id n.name n.description
"MAXO:0000555" "interleukin-1 alpha biomarker measurement" "Detection of interleukin-1 alpha, a mediator of the inflammatory response."
"MAXO:0000558" "interleukin-12 biomarker measurement" "Detection of interleukin-12 levels, an inflammatory cytokine."
"MAXO:0000559" "tumor necrosis factor-alpha biomarker measurement" "Detection of TNF-alpha levels, a cytokine involved in systemic inflammation."
"MPATH:515" "non-Lymphoid neoplasias" "Hematological neoplasias of non-lymphoid origin."
"MAXO:0000520" "obstetric ultrasonography" "Use of medical ultrasonography in pregnancy where sound waves are used to create a real-time visual image of the developing fetus in the uterus. Imaging can include the mother's ovaries and uterus as well."
"MAXO:0000529" "prenatal genetic testing" "Testing of fetal DNA during pregnancy to determine if the fetus has chromosomal aberrations, fetal aneuploidy, or other detectable genetic disorders."
"MPATH:502" "monocytic leukaemia" "Leukaemia in which neoplastic cells are poorly or moderately differentiated with a monocytic but no neutrophilic component. At least 20% of the cells must be blasts."
"MAXO:0000527" "physical examination" "A systemic evaluation of the body and its functions using visual inspection, palpation, percussion and auscultation. The purpose is to determine the presence or absence of physical signs of disease or abnormality for an individual's health assessment."
"MAXO:0000528" "prenatal examination" "A test or diagnostic examination to assess the health status of the mother and well being of the fetus."
"MAXO:0000526" "clinical examination" "A direct assessment of a patient's condition by a clinical health professional that is based on a physical exam, medical history, and the patient's account of symptoms."

Here's how I found out the category information:

match (n) where (n.provided_by in ["['infores:atc-codes-umls']", "['infores:cpt-codes-umls']", "['infores:drugbank']", "['infores:fma-umls']", "['infores:go']", "['infores:hcp-codes-umls']", "['infores:hcpcs-cpt-umls']", "['infores:hgnc']", "['infores:hl7-umls']", "['infores:hpo']", "['infores:icd10-umls']", "['infores:icd10ae-umls']", "['infores:icd10cm-umls']", "['infores:icd10pcs-umls']", "['infores:icd9cm-umls']", "['infores:loinc-umls']", "['infores:medrt-umls']", "['infores:meddra-umls']", "['infores:medlineplus']", "['infores:mesh']", "['infores:umls-metathesaurus']", "['infores:ncbi-taxonomy']", "['infores:ncit']", "['infores:nddf-umls']", "['infores:ndfrt']", "['infores:omim']", "['infores:pdq-umls']", "['infores:psy-umls']", "['infores:rxnorm']", "['infores:snomedct']", "['infores:vandf-umls']", "['infores:umls']"]) and not (n.description contains "STY") return n.category, n.provided_by, count(n) order by count(n) desc
n.category n.provided_by count(n)
"biolink:PhysiologicalProcess" "['infores:umls']" 31261
"biolink:MolecularActivity" "['infores:umls']" 26624
"biolink:Disease" "['infores:umls']" 22273
"biolink:DiseaseOrPhenotypicFeature" "['infores:umls']" 19000
"biolink:ChemicalEntity" "['infores:umls']" 18970
"biolink:Publication" "['infores:umls']" 18074
"biolink:InformationContentEntity" "['infores:umls']" 9282
"biolink:NamedThing" "['infores:umls']" 8521
"biolink:Procedure" "['infores:umls']" 6946
"biolink:CellularComponent" "['infores:umls']" 6101
"biolink:OrganismTaxon" "['infores:umls']" 5420
"biolink:Phenomenon" "['infores:umls']" 3429
"biolink:Activity" "['infores:umls']" 2684
"biolink:Polypeptide" "['infores:umls']" 2326
"biolink:Drug" "['infores:umls']" 2191
"biolink:GrossAnatomicalStructure" "['infores:umls']" 2187
"biolink:Device" "['infores:umls']" 1672
"biolink:PathologicalProcess" "['infores:umls']" 1622
"biolink:BiologicalEntity" "['infores:umls']" 1471
"biolink:Cell" "['infores:umls']" 1313
"biolink:AnatomicalEntity" "['infores:umls']" 1064
"biolink:Behavior" "['infores:umls']" 1024
"biolink:PhysicalEntity" "['infores:umls']" 995
"biolink:PhenotypicFeature" "['infores:hpo']" 896
"biolink:Cohort" "['infores:umls']" 805
"biolink:SmallMolecule" "['infores:drugbank']" 773
"biolink:Agent" "['infores:umls']" 687
"biolink:NamedThing" "['infores:hpo']" 581
"biolink:PhenotypicFeature" "['infores:umls']" 569
"biolink:NucleicAcidEntity" "['infores:umls']" 503
"biolink:IndividualOrganism" "['infores:umls']" 393
"biolink:GeographicLocation" "['infores:umls']" 356
"biolink:PopulationOfIndividualOrganisms" "['infores:umls']" 258
"biolink:Food" "['infores:umls']" 218
"biolink:SmallMolecule" "['infores:umls']" 147
"biolink:InformationContentEntity" "['infores:loinc-umls']" 139
"biolink:ChemicalEntity" "['infores:drugbank']" 106
"biolink:BiologicalEntity" "['infores:hpo']" 69
"biolink:Protein" "['infores:hpo']" 66
"biolink:Event" "['infores:umls']" 54
"biolink:BehavioralFeature" "['infores:hpo']" 46
"biolink:InformationContentEntity" "['infores:mesh']" 36
"biolink:Activity" "['infores:hpo']" 32
"biolink:InformationContentEntity" "['infores:hpo']" 26
"biolink:Protein" "['infores:umls']" 3
"biolink:BiologicalProcess" "['infores:hpo']" 3
"biolink:InformationContentEntity" "['infores:atc-codes-umls']" 1
"biolink:InformationResource" "['infores:hpo']" 1

ecwood avatar Jul 11 '23 18:07 ecwood