ro-crate icon indicating copy to clipboard operation
ro-crate copied to clipboard

Issue: revisiting self-containment of data entities

Open eocarragain opened this issue 6 years ago • 1 comments

Following discussions at Open Repositories, the current spec reads (with added emphasis):

At the basic level, an RO-Crate is a collection of files represented as a Schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc. The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root. Self-containment is a core principle of RO-Crate, i.e. that all Dataset files and relevant metadata SHOULD, as far as possible, be contained by the RO-Crate, rather than referring to external resources. However the RO-Crate MAY also reference external resources which are stored or accessed separately, via URIs, e.g. because these cannot be included for practical or legal reasons.

I suggest we change this to:

An RO-Crate is a collection of files and folders represented as a schema.org Dataset, that together form a meaningful unit for the purposes of communication, citation, distribution, preservation, etc.

Self-containment is a core principle of RO-Crate, i.e. that all files and folders that make up the RO-Crate are contained in or under the RO-Crate Root. For this reason, all Data Entities described in the RO-Crate Metadata File using the hasPart property MUST be reference with a relative path. Note, for some use-cases, some RO-Crate files may be stored in external locations with mechanisms provided to re-compose an RO-Crate when needed; however, from RO-Crate's perspective all Data Entities are local. For example, if using RO-Crate with the Bagit specification, the fetch.txt file can be used for this purpose.

The RO-Crate Metadata File describes the RO-Crate, and MUST be stored in the RO-Crate Root. Self-description is a core principle of RO-Crate, i.e. that all relevant metadata SHOULD, as far as possible, be contained by the RO-Crate, rather than referring to external resources. The RO-Crate Metadata section below describes specific requirements with describing Data Entities and Contextual Entities, including which properties of externally referenced Contextual Entities should appear in the RO-Crate Metadata file.

I think it makes sense to pull apart files/folders (self-containment) from metadata (self-description).

For self-containment (files/olders), removing the option of referencing external files/folders makes RO-Crate a lot simpler to explain and work with. I think it is a good example of where we should be opinionated and constrain the scope of RO-Crate rather than leave things open. From an Research Object perspective, it makes RO-Crate's more explicitly like RO-Bundles/bagit-ro (focused on packaging) rather than the general RO model (which can aggregate content from anywhere). Mechanisms like fetch.txt in Bagit get around some of the 'practical' reasons for referencing external resources, e.g. duplication of large files, and we can illustrate these in implementation guidance.

The issue of access control ("legal reasons") is trickier. One option would be to treat this kind of content as related to, but not a component part of, the RO-Crate. For example, if you want to refer to and describe external content that has some relevance to the RO-Crate, we could use something like pcdm:hasRelatedObject rather than schema:hasPart:

pcdm:hasRelatedObject - Links to a related Object that is not a component part, such as an object representing a donor agreement or policies that govern the resource.

At least this would draw a clear line between strict component parts and external content.

eocarragain avatar Aug 09 '19 00:08 eocarragain

OK, but I can see places where you might want to refer to external datasets AND to have a local record that describes that - eg you have a set which is derived from a standard corpus. You might want to have a Contextual Entity. Dataset record as part of the provenance - I think this would be fine if the "@id" is an http URI. (See #32)

ptsefton avatar Aug 11 '19 22:08 ptsefton