neosemantics
neosemantics copied to clipboard
Node label Inference
Hi,
I'm trying to use the inference engine to get all nodes labeled directly or indirectly by a class via the procedure semantics.inference.nodesLabelled
It looks like it doesn't work in all cases...
I digged into the code to see what's done under the hood. As far as I can see, you do
- get all subclasses D of the given class C
- for each c_i in D get all nodes n_i labeled with it
- return
UNIONof all n_i
So far so good, makes sense.
Just to recap, the lookup of step 1 for some SOME_CLASS_URI and params { catLabel: "Class", subCatRel: "SCO", catNameProp: "uri" } it's basically
MATCH path = (c:`Class`)<-[:`SCO`*]-(s:`Class`)
WHERE s.`uri` in labels
AND NOT (c)-[:`SCO`]->()
AND any(x in nodes (path)
WHERE x.`uri` = 'SOME_CLASS_URI' )
RETURN COLLECT(DISTINCT s.`uri`) + 'SOME_CLASS_URI' as l
But, there is some corner case when your OWL ontology contains some explicit triples that connect a class C to owl:Thing, i.e. (C rdfs:subClassOf owl:Thing) which indeed is semantically redundant but can just happen. In that case, owl:Thing is never labeled with Class which in fact means that the line
AND NOT (c)-[:`SCO`]->()
will exclude all paths ending in class C because it would have such an edge to owl:Thing. Clearly, this means you won't get the subclasses of C which also means you won't get the inferred individuals of class C
A quick fix would be using the Classlabel, i.e.
AND NOT (c)-[:`SCO`]->(:Class)
in that example and in the code it would be
AND NOT (c)-[:`%3$s`]->(:`%1$s`)
That said, is there a particular reason not just doing
MATCH (c:`Class`)<-[:`SCO`*]-(s:`Class`)
WHERE c.`uri`='SOME_CLASS_URI'
RETURN s
to get all subclasses of a class? I guess I'm missing something? Performance maybe?
Cheers and thanks for the plugin.
Thanks for the detailed explanation @LorenzBuehmann ! Your proposed solution makes sense, I'll include it in the next release.
The reason for not going with the simplified version of the query is indeed efficiency/performance. In a large dataset like the one described in this section of the manual with 1.8 million classes and 3.6 million subClassOf relationships), you would be exploring large portions of the hierarchy unnecessarily. And since in Neo4j we can find easily which classes are instantiated, we can reduce the search space to those significantly improving efficiency.