Finding Root Data Entity in RDF
Describe the bug We currently have two sections on finding the root data entity. The first is JSON-LD specific and the second applies to SPARQL. However it would be a good idea for these two to be equivalent, right?
The discrepancies are these:
- The SPARQL will match any IRI containing the string
ro-crate-metadata.json. This is probably designed to matchro-crate-metadata.jsonldas well asro-crate-metadata.json. However it means it will also match subdirectories likesome/dir/ro-crate-metadata.json, and also files with different names likero-crate-metadata.json.bak - The SPARQL algorithm requires the root data entity to be a
Datasetas part where the JSON-LD does not.
URL
- https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/root-data-entity.html#finding-the-root-data-entity
- https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/appendix/relative-uris#finding-ro-crate-root-in-rdf-triple-stores
Suggested fix
For the SPARQL:
PREFIX schema: <http://schema.org/>
SELECT ?crate ?metadatafile
WHERE {
?metadatafile schema:about ?crate .
FILTER(STR(?metadatafile) IN ("ro-crate-metadata.json", "ro-crate-metadata.jsonld"))
}
And then we just say that it's not conformant if crate ends up missing the @type of Dataset
Your suggested fix makes sense to me 👍
Actually, it might be possible to avoid the string conversion, and just simplify it to:
FILTER(?metadatafile IN (<ro-crate-metadata.json>, <ro-crate-metadata.jsonld>))
One problem with this is that it fails (perhaps counterintuitively) in many practical cases where an RDF library / store is used. For instance, in Python:
import rdflib
g = rdflib.Graph()
g.parse("crate/ro-crate-metadata.json")
q = """\
PREFIX schema: <http://schema.org/>
SELECT ?crate ?metadatafile
WHERE {
?metadatafile schema:about ?crate .
FILTER(?metadatafile IN (<ro-crate-metadata.json>, <ro-crate-metadata.jsonld>))
}
"""
res = g.query(q)
for r in res:
print(r)
Nothing is printed, because what's in the graph is actually something like:
<file:///local/path/to/crate/ro-crate-metadata.json>
The same goes for an RDF triple store where IDs have been absolutized with arcp for merging multiple crates (e.g. WorkflowHub Knowledge Graph), where you have terms like:
<arcp://uuid,5ea8dbea-ab9e-5507-cd0d-f174a7f8c012/ro-crate-metadata.json>
Since the spec section has the title "Finding RO-Crate Root in RDF triple stores", the current query would succeed while the one suggested here would fail.
Good point, I found it confusing that the spec never discusses the absolute URL of the crate. As you say, rdflib seems to assume that it can absolutize it using the absolute path to the crate, but that seems to contradict how the spec wants it to be treated.
Anyway, how about we ask the user to provide a crate base IRI, e.g.
PREFIX schema: <http://schema.org/>
SELECT ?crate ?metadatafile
WHERE {
?metadatafile schema:about ?crate .
FILTER(STRAFTER(?metadatafile, STR(?crate_base)) IN ("ro-crate-metadata.json", "ro-crate-metadata.jsonld"))
}
Therefore the user could input crate_base: <file:///local/path/to/crate/> or crate_base: <arcp://uuid,5ea8dbea-ab9e-5507-cd0d-f174a7f8c012 as a SPARQL binding and it would equally work.
The SPARQL algorithm in 1.2-DRAFT had not been updated to match the simplified JSON algorithm 1.2-DRAFT, as we have now hard-coded name and assume "@id": "ro-crate-metadata.json". However it is still I think valid to set a @base in an RO-Crate JSON-LD, in which case the crate_base trick would only work if doing it as described in https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/appendix/relative-uris.html#parsing-as-rdf-with-a-different-ro-crate-root
In a sense then you don't need any FILTER and can just use BIND to set metadatafile to <file:///local/path/to/crate/ro-crate-metadata.json> -- it won't be ro-crate-metadata.jsonld in RO-Crate 1.2.
@elichad can you check if this should block 1.2 release candidate 1?
It sounds to me like this one query is trying to serve multiple purposes (the text seems ambiguous about this too):
- finding the RDE within a graph representing one specific RO-Crate (i.e. a direct loading of
ro-crate-metadata.jsononly) (here you don't want to find nested crates) - finding all RDEs of all RO-Crates in a graph of any size (here you do)
It sounds like we need a different query for each scenario. We could add some text to explain the limitations of the current query - as I understand it, what we have at the moment is more suited to scenario 2, but will work in scenario 1 as well if there are no nested crates (or strangely named files, which should be rare!).
And if we can agree here on a query to include for scenario 1, we can add that to the appendix as well (I defer to others on this as I don't do this RDF/SPARQL side of things very often).
Edit: Re-reading Referencing another metadata document, it looks like we do try to avoid including the ro-crate-metadata.json of any nested/referenced crates as an entity, and if we do, it shouldn't have about. Though if inference is applied to the graph then it will still show up, as we define the inverse relation subjectOf on the referenced crate's entity instead.
If you're happy with this approach then I don't think it blocks the release candidate - it is a minor problem. I'll make a PR for the first part of my suggestion.