RTX 'subclass_of' cycles in KG2c

'subclass_of' cycles in KG2c

Open amykglen opened this issue 3 years ago • 9 comments

in working on KP reasoning requirement 1) in #1268, I went to build an index for Plover that recursively finds all nodes that are biolink:subclass_of a given node in KG2c. (so that if someone is looking for 'diabetes' in their query graph, the query will also effectively consider 'type 2 diabetes', as well as anything that might be a subclass of 'type 2 diabetes', and so on...)

but it quickly became apparent that there are a lot of (directed) 'subclass_of' cycles in KG2c. a couple examples from http://kg2c-5-2.rtx.ai:7474/browser/:

match p=(n)-[:`biolink:subclass_of` *3..4]->(n) return p limit 3

Screen Shot 2021-04-11 at 12 13 49 PM

and there seem to be many such cycles - apparently 26,000 with up to 3 edges in them, but there are also larger cycles (they take forever to count, so I don't have a number, but I suspect it's large). for example, here's a 7-edge one involving Acetaminophen:

match p=(n {id:'CHEMBL.COMPOUND:CHEMBL112'})<-[:`biolink:subclass_of` *1..7]-(n) return p limit 1

(acetaminophen alone is apparently part of about 700 subclass_of cycles with up to 7 edges)

this definitely makes my task much harder, though I'm sure I can find a way to work around it... but in the bigger picture, should we worry about reasoning using this data? it's seeming like a bit of the wild west... I'm guessing most of these are the result of little bugs in KG2/its upstream sources and/or the node synonymizer (haven't dived in to investigate)... and I imagine they'll be quite hard to totally eradicate.

maybe there's some cleaner way of getting 'subclass_of' info for this purpose (KP reasoning)? for example, should we only trust biolink:subclass_of edges from certain provided_bys? (e.g., maybe don't trust such edges from SEMMED?)

Apr 12 '21 00:04 amykglen

RTX RTX copied to clipboard

'subclass_of' cycles in KG2c

RTX
RTX copied to clipboard