RTX icon indicating copy to clipboard operation
RTX copied to clipboard

Demo stretch goal: `ARAX_standardize`

Open dkoslicki opened this issue 5 years ago • 8 comments

Basically, if we receive a message from the ARS that uses CURIE's that aren't in our KGNodeIndex, then we don't know what they are. The goal of this would be to attempt to convert all specified CURIE's to things we know about:

  • [ ] Create a QueryRenciNodeNormalization.py similar to QueryCOHD.py but that uses this endpoint.
  • [ ] Iterate over all nodes in the incoming QG, and use QueryRenciNodeNormalization.py to see if we can "normalize" them to CURIE's we recognize.
  • [ ] Wrap this in a class like ARAX_normalize.py similar to ARAX_{overlay, filter_kg, expander}.py
  • [ ] Integrate into ARAX_query.py so we can use it in the DSL to first normalize an incoming message QG/KG.

That way, we don't throw an error just because someone is using, say, UMLS CURIE's (not in KG1) to specify something equivalent to an HP CURIE (in KG1).

dkoslicki avatar Feb 23 '20 03:02 dkoslicki

I like it. I think we can totally do this by demo time.

edeutsch avatar Feb 23 '20 04:02 edeutsch

Just a bump @edeutsch to see if this was still on your radar. i.e. if we should change this to "low priority" from "future improvements" or something like that.

dkoslicki avatar Mar 06 '20 01:03 dkoslicki

yeah, I've been musing about it a bit. I was also wondering if we can use our new, much expanded KGNodeIndex to do more or less the same thing. KGNodeIndex now has tons of different identifiers for the same thing. Not generally good, in my view, but perhaps a benefit here. My concern is that KGNodeIndex now knows comprehensively what is in KG2 but does not know what is in KG1. I'm kind of uncomfortable with this, I think it's going to cause a lot of problems, as I've said. I've been musing about having both a KG1 index and a KG2 index. In principle then what we want for a QueryGraph is a node in KG1 or it will likely fail. We could potentially use the KG2 index as a translation service to KG1 nodes. Which would be 1000x faster than calling out to a web service. We could of course fall back on the RENCI system if this failed.

edeutsch avatar Mar 06 '20 02:03 edeutsch

Just a note, currently, the KGNodeIndex has both the KG1 and KG2 nodes in it: I append the KG2 nodes to KG1 nodes here.

Are the problems you raise mainly when querying with names (rather than CURIE’s)? Just trying to understand your concerns.

But +1 to using KG2 index first to synonymize before needing to reach out to RENCI to do the same thing (or if we can’t).

dkoslicki avatar Mar 06 '20 06:03 dkoslicki

the issue for translation is this: if a query comes in with SNOMEDCT:41345002, I can trivial learn that all the synonyms for rickets: rickets = ['HP:0002748', 'DOID:10609', 'CUI:C0035579', 'CHV:0000010891', 'HPO:HP%3A0002748', 'MEDDRA:10039119', 'MEDCIN:33625', 'MEDLINEPLUS:3898', 'MESH:D012279', 'NCIT:C26878', 'NCI_NCI-GLOSS:CDR0000655123', 'NCI_NICHD:C26878', 'OMIM:MTHU006645', 'SNOMEDCT:41345002', 'EFO:0005583']

But how will I know which of these is in KG1? KGNodeIndex does not know to pick DOID:10609 because its data store is the union of KG1 and KG2 and it doesn't know what's in KG1.

edeutsch avatar Mar 06 '20 06:03 edeutsch

Ah, I understand now. And certainly see how this is an issue.

Maybe expand should be able to dynamically figure out which KP’s can identify which nodes/edges?

This vaguely reminds me of my feeble efforts to start enumerating which KP’s know about what (and examples of this in code as well). Seem like working with the standards team (SRI?/SIR?/forget the acronym) post demo will be helpful for this.

“Who knows about what” will definitely become an issue with this distributed plan NCATS has in mind.

dkoslicki avatar Mar 06 '20 06:03 dkoslicki

I have implemented this new method in ARAXMessenger in demo:

result = messenger.reassign_curies(message, { 'knowledge_provider': 'KG1', 'mismap_result': 'WARNING'} )

It will loop through the QueryGraph in a message (not the KnowledgeGraph, although maybe this in an interesting add-on) and will attempt to remap CURIEs to either KG1 or KG2 if they are not present in that resource. i.e. if a CURIE is in a target KP, then fine. If it is not in a target KP (usually KG1), then try one's best to remap it to that. If it cannot be, throw a WARNING (generally allows continuation) or an ERROR (generally halts execution).

The question is where to deploy this? It could be an explicit command. Although this might be tedious for the user to remember to do. Or, I'm think that that maybe Expand() should call this before trying its expand operations. hmm, although reflecting on this more, maybe we don't want to do this for a whole graph. Maybe just for certain nodes. Say if the Expander will call different KPs for different edges, maybe it will want to just remap the involved nodes. hmm. anyway, it is currently implemented for the whole QueryGraph.

It is not automatically called anywhere yet.

edeutsch avatar Mar 11 '20 06:03 edeutsch

Can we close out this issue?

saramsey avatar Nov 30 '23 19:11 saramsey