shex icon indicating copy to clipboard operation
shex copied to clipboard

Blank node behavior when using SPARQL

Open hsolbrig opened this issue 5 years ago • 3 comments

At the moment, different ShEx implementations exhibit different behaviors when crossing BNodes in SPARQL.

PyShEx has three options:

  1. Throw an error when attempting to submit a SPARQL query with a BNode subject or object
  2. Assume that the SPARQL endpoint maintains persistent BNodes (which may cause a hang / timeout if not true)
  3. Take advantage of the GraphDB specific solution

Shex.js only implements option 2)

Not sure on other implementations

Do we want to specify a consistent behavior across interpreters? If so, what should that be?

hsolbrig avatar Nov 12 '20 09:11 hsolbrig

I've always voiced the opinion that it should be illegal to use a blank node as a ShEx starting point, as in RDF, there is no expectation that one used in a serialization will be maintained within a datastore; I think it should be illegal. This is the use case skolem IDs were created for, although I'm not a great fan of those, either.

Better to use a query to identify a starting node, where the query would result in the desired node.

gkellogg avatar Nov 18 '20 17:11 gkellogg

I think that's a separate issue though; this is about how you practically re-visit a bnode you got in response to a previous query. This is an issue for remote faceted browsing, ShEx validation, and anyone else iteratively querying a SPARQL endpoint.

ericprud avatar Nov 20 '20 09:11 ericprud

I'm currently adding both arrival path and disambiguation code in the ShEx.js SPARQL interface. This allows it to:

  1. remember how it got to any bnode
  2. distinguish all of the visited bnodes from each other.

Wikidata (augmented) example:

wd:Q313093 <P999> _:a .
_:a
  # works
  <P2860> _:a ; # apparently, a bare blank node stands for unknown value
  # advisors
  <P184> wd:Q123 , _:1e_____ , _:xe_____ , _:ye_____ , _:1cd__2g , _:1cd__2f , _:1cdef2g , _:1cdef2f .

# advisors (mostly bnodes to exercise disambiguator)
wd:Q123                                                         <P735> "a" , "b" .
_:1e_____ <P000> wd:Qe                                        ; <P735> "abc" .
_:xe_____ <P000> wd:Qe                                        ; <P735> "abc" .
_:ye_____ <P000> wd:Qe                                        ; <P735> "abc" .
_:1cd__2g <P000> wd:Qc , wd:Qd                 ; <P001> wd:Qg ; <P735> "abc" .
_:1cd__2f <P000> wd:Qc , wd:Qd                 ; <P001> wd:Qf ; <P735> "abc" .
_:1cdef2g <P000> wd:Qc , wd:Qd , wd:Qe , wd:Qf ; <P001> wd:Qg ; <P735> "abc" .
_:1cdef2f <P000> wd:Qc , wd:Qd , wd:Qe , wd:Qf ; <P001> wd:Qf ; <P735> "abc" .

The data structure is (JSON liberalized to include RDF terms) to identify e.g. _:1cdef2g is

{ start: wd:Q313093, path: [
  {p:<P999>}, # no ambiguity
  {p:<P184>, unique: {
     <P000>: [wd:Qc, wd:Qd],
    <P001> = [wd:Qg]
   }
]

which allows you to select for _:1cdef2g ?p ?o like:

SELECT ?1 ?p ?o WHERE {
  wd:Q313093 <P999> ?0 . # no ambiguity
  ?0 <P184> ?1 .
  ?1 <P000> wd:Qc , wd:Qd . ?1 <P001> wd:Qg . # disambiguate
 FILTER NOT EXISTS {?1 <P000> ?2 FILTER (NOT (?2 IN (wd:Qc, wd:Qg)) }
  ?1 ?p ?o
}

_:1e_____, _:xe_____ , and _:ye_____ are provably interchangeable so the data structure for the former needs to indidate that it's serving for three:

{ start: wd:Q313093, path: [
  {p:<P999>}, # no ambiguity
  {p:<P184>, unique: {
     <P001> = [wd:Qe]
    }, proxies: [ _:xe_____ , _:ye_____ ]
  }
]

and _:xe_____ , and _:ye_____ simply execute the query for _:1e_____.

I haven't tested for corefs, which would be another way to disambiguate AND might prove that 1e, xe and ye aren't all interchangable, but we'd only have to do those tests iff the schema included inverse arcs in the right places.

ericprud avatar Nov 20 '20 09:11 ericprud