RTX
RTX copied to clipboard
'subclass_of' cycles in KG2c
in working on KP reasoning requirement 1) in #1268, I went to build an index for Plover that recursively finds all nodes that are biolink:subclass_of
a given node in KG2c. (so that if someone is looking for 'diabetes' in their query graph, the query will also effectively consider 'type 2 diabetes', as well as anything that might be a subclass of 'type 2 diabetes', and so on...)
but it quickly became apparent that there are a lot of (directed) 'subclass_of' cycles in KG2c. a couple examples from http://kg2c-5-2.rtx.ai:7474/browser/:
match p=(n)-[:`biolink:subclass_of` *3..4]->(n) return p limit 3
and there seem to be many such cycles - apparently 26,000 with up to 3 edges in them, but there are also larger cycles (they take forever to count, so I don't have a number, but I suspect it's large). for example, here's a 7-edge one involving Acetaminophen:
match p=(n {id:'CHEMBL.COMPOUND:CHEMBL112'})<-[:`biolink:subclass_of` *1..7]-(n) return p limit 1

(acetaminophen alone is apparently part of about 700 subclass_of cycles with up to 7 edges)
this definitely makes my task much harder, though I'm sure I can find a way to work around it... but in the bigger picture, should we worry about reasoning using this data? it's seeming like a bit of the wild west... I'm guessing most of these are the result of little bugs in KG2/its upstream sources and/or the node synonymizer (haven't dived in to investigate)... and I imagine they'll be quite hard to totally eradicate.
maybe there's some cleaner way of getting 'subclass_of' info for this purpose (KP reasoning)? for example, should we only trust biolink:subclass_of
edges from certain provided_by
s? (e.g., maybe don't trust such edges from SEMMED?)