ro-crate Finding Root Data Entity in RDF

Describe the bug We currently have two sections on finding the root data entity. The first is JSON-LD specific and the second applies to SPARQL. However it would be a good idea for these two to be equivalent, right?

The discrepancies are these:

The SPARQL will match any IRI containing the string ro-crate-metadata.json. This is probably designed to match ro-crate-metadata.jsonld as well as ro-crate-metadata.json. However it means it will also match subdirectories like some/dir/ro-crate-metadata.json, and also files with different names like ro-crate-metadata.json.bak
The SPARQL algorithm requires the root data entity to be a Dataset as part where the JSON-LD does not.

URL

https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/root-data-entity.html#finding-the-root-data-entity
https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/appendix/relative-uris#finding-ro-crate-root-in-rdf-triple-stores

Suggested fix

For the SPARQL:

PREFIX schema:  <http://schema.org/>
SELECT ?crate ?metadatafile
WHERE {
  ?metadatafile schema:about ?crate .
  FILTER(STR(?metadatafile) IN ("ro-crate-metadata.json", "ro-crate-metadata.jsonld"))
}

And then we just say that it's not conformant if crate ends up missing the @type of Dataset

Mar 05 '25 06:03 multimeric

Your suggested fix makes sense to me 👍

Mar 05 '25 11:03 elichad

Actually, it might be possible to avoid the string conversion, and just simplify it to:

FILTER(?metadatafile IN (<ro-crate-metadata.json>, <ro-crate-metadata.jsonld>))

Mar 05 '25 23:03 multimeric

One problem with this is that it fails (perhaps counterintuitively) in many practical cases where an RDF library / store is used. For instance, in Python:

import rdflib
g = rdflib.Graph()
g.parse("crate/ro-crate-metadata.json")

q = """\
PREFIX schema: <http://schema.org/>

SELECT ?crate ?metadatafile
WHERE {
  ?metadatafile schema:about ?crate .
  FILTER(?metadatafile IN (<ro-crate-metadata.json>, <ro-crate-metadata.jsonld>))
}
"""

res = g.query(q)
for r in res:
    print(r)

Nothing is printed, because what's in the graph is actually something like:

<file:///local/path/to/crate/ro-crate-metadata.json>

The same goes for an RDF triple store where IDs have been absolutized with arcp for merging multiple crates (e.g. WorkflowHub Knowledge Graph), where you have terms like:

<arcp://uuid,5ea8dbea-ab9e-5507-cd0d-f174a7f8c012/ro-crate-metadata.json>

Since the spec section has the title "Finding RO-Crate Root in RDF triple stores", the current query would succeed while the one suggested here would fail.

Mar 13 '25 11:03 simleo

Good point, I found it confusing that the spec never discusses the absolute URL of the crate. As you say, rdflib seems to assume that it can absolutize it using the absolute path to the crate, but that seems to contradict how the spec wants it to be treated.

Anyway, how about we ask the user to provide a crate base IRI, e.g.

PREFIX schema: <http://schema.org/>

SELECT ?crate ?metadatafile
WHERE {
  ?metadatafile schema:about ?crate .
  FILTER(STRAFTER(?metadatafile, STR(?crate_base)) IN ("ro-crate-metadata.json", "ro-crate-metadata.jsonld"))
}

Therefore the user could input crate_base: <file:///local/path/to/crate/> or crate_base: <arcp://uuid,5ea8dbea-ab9e-5507-cd0d-f174a7f8c012 as a SPARQL binding and it would equally work.

Mar 14 '25 05:03 multimeric

The SPARQL algorithm in 1.2-DRAFT had not been updated to match the simplified JSON algorithm 1.2-DRAFT, as we have now hard-coded name and assume "@id": "ro-crate-metadata.json". However it is still I think valid to set a @base in an RO-Crate JSON-LD, in which case the crate_base trick would only work if doing it as described in https://www.researchobject.org/ro-crate/specification/1.2-DRAFT/appendix/relative-uris.html#parsing-as-rdf-with-a-different-ro-crate-root

In a sense then you don't need any FILTER and can just use BIND to set metadatafile to <file:///local/path/to/crate/ro-crate-metadata.json> -- it won't be ro-crate-metadata.jsonld in RO-Crate 1.2.

Mar 27 '25 15:03 stain

@elichad can you check if this should block 1.2 release candidate 1?

Mar 27 '25 15:03 stain

It sounds to me like this one query is trying to serve multiple purposes (the text seems ambiguous about this too):

finding the RDE within a graph representing one specific RO-Crate (i.e. a direct loading of ro-crate-metadata.json only) (here you don't want to find nested crates)
finding all RDEs of all RO-Crates in a graph of any size (here you do)

It sounds like we need a different query for each scenario. We could add some text to explain the limitations of the current query - as I understand it, what we have at the moment is more suited to scenario 2, but will work in scenario 1 as well if there are no nested crates (or strangely named files, which should be rare!).

And if we can agree here on a query to include for scenario 1, we can add that to the appendix as well (I defer to others on this as I don't do this RDF/SPARQL side of things very often).

Edit: Re-reading Referencing another metadata document, it looks like we do try to avoid including the ro-crate-metadata.json of any nested/referenced crates as an entity, and if we do, it shouldn't have about. Though if inference is applied to the graph then it will still show up, as we define the inverse relation subjectOf on the referenced crate's entity instead.

Mar 27 '25 16:03 elichad

If you're happy with this approach then I don't think it blocks the release candidate - it is a minor problem. I'll make a PR for the first part of my suggestion.

Mar 27 '25 16:03 elichad