neosemantics Node label Inference

Node label Inference

Open LorenzBuehmann opened this issue 6 years ago • 1 comments

Hi, I'm trying to use the inference engine to get all nodes labeled directly or indirectly by a class via the procedure semantics.inference.nodesLabelled

It looks like it doesn't work in all cases...

I digged into the code to see what's done under the hood. As far as I can see, you do

get all subclasses D of the given class C
for each c_i in D get all nodes n_i labeled with it
return UNION of all n_i

So far so good, makes sense.

Just to recap, the lookup of step 1 for some SOME_CLASS_URI and params { catLabel: "Class", subCatRel: "SCO", catNameProp: "uri" } it's basically

MATCH path = (c:`Class`)<-[:`SCO`*]-(s:`Class`)
WHERE s.`uri` in labels 
AND NOT (c)-[:`SCO`]->() 
AND any(x in nodes (path)
WHERE x.`uri` = 'SOME_CLASS_URI' ) 
RETURN COLLECT(DISTINCT s.`uri`) + 'SOME_CLASS_URI'  as l

But, there is some corner case when your OWL ontology contains some explicit triples that connect a class C to owl:Thing, i.e. (C rdfs:subClassOf owl:Thing) which indeed is semantically redundant but can just happen. In that case, owl:Thing is never labeled with Class which in fact means that the line

AND NOT (c)-[:`SCO`]->()

will exclude all paths ending in class C because it would have such an edge to owl:Thing. Clearly, this means you won't get the subclasses of C which also means you won't get the inferred individuals of class C

A quick fix would be using the Classlabel, i.e.

AND NOT (c)-[:`SCO`]->(:Class)

in that example and in the code it would be

AND NOT (c)-[:`%3$s`]->(:`%1$s`)

That said, is there a particular reason not just doing

MATCH (c:`Class`)<-[:`SCO`*]-(s:`Class`) 
WHERE c.`uri`='SOME_CLASS_URI'
RETURN s

to get all subclasses of a class? I guess I'm missing something? Performance maybe?

Cheers and thanks for the plugin.

Sep 04 '19 06:09 LorenzBuehmann

Thanks for the detailed explanation @LorenzBuehmann ! Your proposed solution makes sense, I'll include it in the next release.

The reason for not going with the simplified version of the query is indeed efficiency/performance. In a large dataset like the one described in this section of the manual with 1.8 million classes and 3.6 million subClassOf relationships), you would be exploring large portions of the hierarchy unnecessarily. And since in Neo4j we can find easily which classes are instantiated, we can reduce the search space to those significantly improving efficiency.

Sep 04 '19 09:09 jbarrasa

neosemantics neosemantics copied to clipboard

Node label Inference

neosemantics
neosemantics copied to clipboard