Node with the name `related`
There appears to be a node CHV:0000018040 with the category biolink:ChemicalEntity with the name related. This seems to be a bug (and throws off a tiny portion of our LLM NLP->TRAPI in https://github.com/Translator-CATRAX/catrax-milestones/issues/36). Easy enough for us to fix locally, but thought you would want to know/investigate to see if it's part of a larger problem.
Hi @dkoslicki thank you for the bug report. Where did this appear? In an ARAX result?
In the KG2 JSON-lines dump, when making a RAG database to help with NLP->TRAPI. So not in ARAX, but rather in the KG2 dump itself
"Bad node" is a "Bad issue title". Updated
Here is the node in KG2pre:
{"category": "biolink:ChemicalEntity", "category_label": "chemical_entity", "creation_date": null, "deprecated": false, "description": "UMLS Semantic Type: STY:T109", "full_name": "related", "has_biological_sequence": null, "id": "CHV:0000018040", "iri": "http://purl.bioontology.org/ontology/CHV/0000018040", "name": "related", "provided_by": ["infores:chv-umls"], "publications": [], "replaced_by": null, "synonym": ["relate", "relates"], "update_date": "2023"}
It comes from the UMLS import.
There's only one edge featuring this node and it's a cross reference:
{"agent_type": "manual_agent", "domain_range_exclusion": false, "id": "CHV:0000018040---UMLS:xref---None---None---None---UMLS:C0163712---umls_source:CHV", "knowledge_level": "knowledge_assertion", "negated": false, "object": "UMLS:C0163712", "predicate": "biolink:close_match", "predicate_label": "xref", "primary_knowledge_source": "infores:chv-umls", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "xref", "source_predicate": "UMLS:xref", "subject": "CHV:0000018040", "update_date": "2023"}
The entry for the cross referenced node is:
{"category": "biolink:ChemicalEntity", "category_label": "chemical_entity", "creation_date": null, "deprecated": false, "description": "UMLS Semantic Type: STY:T109", "full_name": "Relate - vinyl resin", "has_biological_sequence": null, "id": "UMLS:C0163712", "iri": "https://identifiers.org/umls:C0163712", "name": "Relate - vinyl resin", "provided_by": ["infores:umls"], "publications": [], "replaced_by": null, "synonym": ["related", "relate", "relates"], "update_date": "2023"}
Here is the "raw" (from umls.jsonl) data regarding this node:
{"('CHV', '0000018040')": {"attributes": {"COMBO_SCORE": ["0.880223872", "0.870335809", "0.8118"], "COMBO_SCORE_NO_TOP_WORDS": ["0.880223872", "0.870335809", "0.8118"], "CONTEXT_SCORE": ["-1", "0.9"], "CUI_SCORE": ["0.8118"], "DISPARAGED": ["no"], "FREQUENCY": ["-1", "0.928871617"]}, "cuis": ["C0163712"], "names": {"PT": {"Y": ["related"]}, "SY": {"Y": ["relate", "relates"]}}, "tuis": ["T109"]}}
This raises a couple of issues. T109 should be small molecule (per the Biolink Model version 4.2.5, which was used in the build of KG2.10.2). This likely means that there is an issue with the validation script. Second, there seems to be an issue with CHV/UMLS for categorizing a nonsense term with that TUI.