RTX-KG2 icon indicating copy to clipboard operation
RTX-KG2 copied to clipboard

Node with the name `related`

Open dkoslicki opened this issue 7 months ago • 4 comments

There appears to be a node CHV:0000018040 with the category biolink:ChemicalEntity with the name related. This seems to be a bug (and throws off a tiny portion of our LLM NLP->TRAPI in https://github.com/Translator-CATRAX/catrax-milestones/issues/36). Easy enough for us to fix locally, but thought you would want to know/investigate to see if it's part of a larger problem.

dkoslicki avatar May 14 '25 15:05 dkoslicki

Hi @dkoslicki thank you for the bug report. Where did this appear? In an ARAX result?

saramsey avatar Jun 03 '25 15:06 saramsey

In the KG2 JSON-lines dump, when making a RAG database to help with NLP->TRAPI. So not in ARAX, but rather in the KG2 dump itself

dkoslicki avatar Jun 03 '25 15:06 dkoslicki

"Bad node" is a "Bad issue title". Updated

dkoslicki avatar Jun 03 '25 15:06 dkoslicki

Here is the node in KG2pre:

{"category": "biolink:ChemicalEntity", "category_label": "chemical_entity", "creation_date": null, "deprecated": false, "description": "UMLS Semantic Type: STY:T109", "full_name": "related", "has_biological_sequence": null, "id": "CHV:0000018040", "iri": "http://purl.bioontology.org/ontology/CHV/0000018040", "name": "related", "provided_by": ["infores:chv-umls"], "publications": [], "replaced_by": null, "synonym": ["relate", "relates"], "update_date": "2023"}

It comes from the UMLS import.

There's only one edge featuring this node and it's a cross reference:

{"agent_type": "manual_agent", "domain_range_exclusion": false, "id": "CHV:0000018040---UMLS:xref---None---None---None---UMLS:C0163712---umls_source:CHV", "knowledge_level": "knowledge_assertion", "negated": false, "object": "UMLS:C0163712", "predicate": "biolink:close_match", "predicate_label": "xref", "primary_knowledge_source": "infores:chv-umls", "publications": [], "publications_info": {}, "qualified_object_aspect": null, "qualified_object_direction": null, "qualified_predicate": null, "relation_label": "xref", "source_predicate": "UMLS:xref", "subject": "CHV:0000018040", "update_date": "2023"}

The entry for the cross referenced node is:

{"category": "biolink:ChemicalEntity", "category_label": "chemical_entity", "creation_date": null, "deprecated": false, "description": "UMLS Semantic Type: STY:T109", "full_name": "Relate - vinyl resin", "has_biological_sequence": null, "id": "UMLS:C0163712", "iri": "https://identifiers.org/umls:C0163712", "name": "Relate - vinyl resin", "provided_by": ["infores:umls"], "publications": [], "replaced_by": null, "synonym": ["related", "relate", "relates"], "update_date": "2023"}

Here is the "raw" (from umls.jsonl) data regarding this node:

{"('CHV', '0000018040')": {"attributes": {"COMBO_SCORE": ["0.880223872", "0.870335809", "0.8118"], "COMBO_SCORE_NO_TOP_WORDS": ["0.880223872", "0.870335809", "0.8118"], "CONTEXT_SCORE": ["-1", "0.9"], "CUI_SCORE": ["0.8118"], "DISPARAGED": ["no"], "FREQUENCY": ["-1", "0.928871617"]}, "cuis": ["C0163712"], "names": {"PT": {"Y": ["related"]}, "SY": {"Y": ["relate", "relates"]}}, "tuis": ["T109"]}}

This raises a couple of issues. T109 should be small molecule (per the Biolink Model version 4.2.5, which was used in the build of KG2.10.2). This likely means that there is an issue with the validation script. Second, there seems to be an issue with CHV/UMLS for categorizing a nonsense term with that TUI.

ecwood avatar Jun 22 '25 22:06 ecwood