ro-crate Data Entity identification and handling in tools

This issue is about the behavior of tools such as rocrate-validator and ro-crate-py when dealing with data entities, and whether the spec is clear enough in this respect. Their current approach is described in https://github.com/crs4/rocrate-validator/issues/62#issuecomment-2624181022 -- https://github.com/crs4/rocrate-validator/issues/62 was opened by me after discussing the handling of files in https://github.com/nextflow-io/nf-prov/pull/39, which is adding Workflow Run RO-Crate support to Nextflow, but then I discussed it with @kikkomep and stopped thinking that it was a bug in the validator.

In short, given this statement in the spec (it's the same in 1.1 and 1.2-DRAFT):

Where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the hasPart property.

should a tool consider File and Dataset entities as data entities and check that they are linked from the Root Data Entity (RDE)'s hasPart (this is what the validator does) or should it assume that everything that's linked from the RDE's hasPart is a data entity (this is what ro-crate-py does)? Or is it right for different tools with different roles to do different things?

I currently think that the validator is doing the right thing, since it's supposed to check for things like forgetting to link data entities from the RDE's hasPart. Regarding ro-crate-py, I'm starting to have doubts: in particular, following the "indirectly" bit above, it implements a recursive walk of hasPart properties, and I've recently noticed that this leads to a weird situation where SoftwareApplication entities are read as data entities because the ComputationalWorkflow is also a File and it links to the workflow's tools via hasPart as prescribed by the Workflow Run Crate profile (which follows Bioschemas in this respect). Maybe it should only follow hasPart from Dataset and not File?

Should the spec statement cited above be made clearer in order to help implementations? E.g.:

File and Dataset entities in the RO-Crate JSON-LD MUST be linked to from the Root Data Entity using the hasPart property. This link could be indirect, meaning that the Root Data Entity links to a Dataset whose hasPart, in turn, links to other File and Dataset entities.

Or is it better to leave it as it is, and allow different tools to do what's more appropriate for their purpose?

Feb 06 '25 12:02 simleo

Notes

My understanding of the current situation from a recent discussion with Stian is this:

all data entities must have @type of either File or Dataset
but not all entities with type File or Dataset are necessarily data entities
data entities must be reachable from the root data entity via hasPart relations (but this isn't really what defines them, only how you expect to find them)

Looking through the spec again now, I don't think we have a machine-actionable definition of what a data entity is. We have a section Contextual vs Data entities that says (same in both 1.1 and 1.2-DRAFT):

Data entities primarily exist in their own right as a file or directory (which may be in the RO-Crate Root directory or downloadable by URL).

And 1.2-DRAFT includes a counter-example for the assumption that all Files are data entities:

Files in the RO-Crate Root are not necessarily data entities – the RO-Crate Metadata Descriptor is a file in the RO-Crate Root, but is considered a Contextual Entity as it is describing the RO-Crate, rather than being part of it. On the other hand, the Root Data Entity is a data entity within its own metadata file.

In both versions we also have this, which shows that checking a web-based entity for downloadability also doesn't indicate whether it should be in hasPart. That makes things more difficult:

Some contextual entities can also be considered data entities – for instance the license property refers to a CreativeWork that can reasonably be downloaded, however a license document is not usually considered as part of research outputs and would therefore typically not be included in hasPart on the root data entity.

Thoughts

I think it would be nice to declare a fully machine-actionable definition of data entities. For example, I know @ptsefton had some ideas about using conformsTo on individual datasets (see this line in draft PR #390 https://github.com/ResearchObject/ro-crate/pull/388/files#diff-93edefce62dc56f998054f6c2d9eb87bc0d317af57df2b4bf243ba0f6f0c5400R138). I don't know if that's the specific approach we want to take, but it seems like we may need to add something if we want to avoid needing that element of human judgment about whether something is or isn't a data entity. (Though that said, I guess the human judgement part does get filled into hasPart. Just a validator can't check if the judgement is right!)

More tangibly, there is at least an option to partially improve this for entities which use a local/relative URI for their @id, as (I think) if those have @type of File or Dataset, they are definitely data entities.

Feb 06 '25 14:02 elichad

Thanks @elichad this is indeed still confusing. I will attempt to clear it up.

Entities of type File are always data entities. In an Attached context, these MUST be present in Root Crate Root and for web entities (Attached or Detached context) it would be up to the client to be validating that they are there.

That statement "Files in the RO-Crate Root are not necessarily data entities" needs to be reworded - it invites confusion of files in the real world with entities in the @graph. We should say that the RO-Crate Metadata Descriptor is not considered a Data Entity and the RO-Crate Metadata File MUST (or SHOULD?) not reference itself as a File.

Files present in in the RO-Crate Root of an Attached RO-Crate Package do not have to be represented as data entities. (ASIDE: I think this is covered elsewhere). The RO-Crate Metadata Descriptor a Contextual Entity which describes the RO-Crate as a whole and identifies the entry point for the RO-Crate. The use of the @id of ro-crate-metadata.json is a convention, and does not imply that the descriptor is a Data Entity.

The Root Data Entity in any RO-Crate is a Data Entity.

The statement about licenses can be reworded. I would take out the implication that a web-based license is a Data Entity when it does not have File as one of its @type values.

Some contextual entities may reference data in similar way to Data Entities for instance the license property refers to a CreativeWork that can reasonably be downloaded, however a license document is not usually considered as part of research outputs and would not be included in hasPart on the root data entity.

{ example 1 .with a CC license ..}

If, however a copy of the license is intended to be included in an Attached RO-Crate Package then it MUST:

have an additional type of File.

Have an @id which is a relative URI which references a copy of the license that is present in RO-Crate Root.

Indicate that the File is part of the package via hasPart (An aside here -- this requirement to have hasPart seems to me like it something we could drop in RO-Crate 2 (and maybe even 1.2) -- if data entities are well defined then why force people to have this extra step that can be quite error prone? We could just say if it's a File it's part of the RO-Crate).

{ example 2 .with a CC license ..}

In a Detatched RO-Crate Package, a license MAY be included in the packaged files by adding the type File - and optionally supplying a localPath property to indicate where the license may be stored in an RO-Crate Root if the packages is downloaded.

{ example fragment - adding a localPath to example 1 above }

I think this would clear things up for File. Dataset is a bit more problematic, but I think a Dataset is considered to be a Data Entity in the following scenarios:

When it is the Root Data Entity (Attached or Detatched)
In an Attached RO-Crate Package when it has a relative URI.

In all other cases, Dataset should be considered as a Contextual Entity. (NOTE: In cases where a Dataset has an absolute URI @id then resolving that to a list of Files or Datasets is a complicated and out of scope for RO-Crate 1.2, though implementors may choose to build software that uses this approach).

How does this look @elichad and @stain? File -- ALWAYS a Data Entity. Dataset only a data entity in an attached context where we can reliably get a directory listing. (And I know it's a late entry, but how about dropping the MUST on hasPart as it really just makes for a lot of extra checking and dealing with things that are in hasPart but not present as Data Entities etc. This an area we could simplify software. (Not to say you can't use it to show pathways to data from the root but libraries and HTML previews etc can provide a list of files and directories easily enough programmatically).

I don't think we need conformsTo here @elichad.

Feb 06 '25 21:02 ptsefton

That statement "Files in the RO-Crate Root are not necessarily data entities" needs to be reworded - it invites confusion of files in the real world with entities in the @graph.

Indeed that is the mistake that I made 😅 thanks for pointing it out.

We should say that the RO-Crate Metadata Descriptor is not considered a Data Entity and the RO-Crate Metadata File MUST (or SHOULD?) not reference itself as a File.

Linking this to the recently opened #394 which was asking about this.

File -- ALWAYS a Data Entity. Dataset only a data entity in an attached context where we can reliably get a directory listing. (And I know it's a late entry, but how about dropping the MUST on hasPart as it really just makes for a lot of extra checking and dealing with things that are in hasPart but not present as Data Entities etc.

Interesting idea - are you suggesting dropping the hasPart requirement to a SHOULD or a MAY, or removing it completely? It does have its benefits for indicating nested structures in a crate.

Also, part of the challenge with the Nextflow PR linked at the top of thread is: when you are describing a workflow execution that generates some intermediate files, you might want to represent them in the metadata (as inputs/outputs for steps) but not include them in the crate (e.g. because they are large). So they're not data entities, but File is an intuitive choice for their type. We can advise alternative types to use in this case (e.g. CreativeWork or DigitalDocument), but it feels a bit strange.

Feb 07 '25 14:02 elichad

I am suggesting that hasPart could be optional. It is useful for showing hierarchy if you want it, but for basic packaging it's actually not necessary if we sort out our expectations about whether data needs to be present - it can be inferred that if there are File and Directory entities with URL or path IDs then they're part of the package. And as @simleo notes, following hasPart recursively has lots of issues and it's complicated for both producers and consumers.

Regarding files that don't (yet) exist I agree that it makes sense that these are @type File. think there are two solutions worth considering:

We add a property to indicate that the file does not or might not yet exist something like dontValidate, I am not sure if there's an obvious one from a standard schema
For files that don't exist give them a local id like #file/that/does/not/yet/exist.txt with a localPath to indicate what the path -- this pattern would indicate that File is not in the package but may come in to existence at localPath

Feb 07 '25 21:02 ptsefton

Also, part of the challenge with the Nextflow PR linked at the top of thread is: when you are describing a workflow execution that generates some intermediate files, you might want to represent them in the metadata (as inputs/outputs for steps) but not include them in the crate (e.g. because they are large). So they're not data entities, but File is an intuitive choice for their type. We can advise alternative types to use in this case (e.g. CreativeWork or DigitalDocument), but it feels a bit strange.

Intermediate files are added as CreativeWork in the Nextflow plugin. Their IDs look like:

#task/c6cb99a1e70c4b8f2eb83700dc0145d9/test_1.fastp.fastq.gz

BTW, they have released version 1.4.0 of the plugin with support for Workflow Run RO-Crate.

Feb 10 '25 09:02 simleo

So if there is already precedent then we could stick with File reserved for things that are part of the package and MUST be there rather than adding something for File entities that may not exist? What does everyone think about that? Makes it simpler.

We can look at further nuance in V2.

Dr Peter Sefton Senior Technical Advisor, School of Languages and Culture Mobile: 0404 096 932

From: Simone Leo @.> Sent: Monday, February 10, 2025 20:17 To: ResearchObject/ro-crate @.> Cc: Peter Sefton @.>; Mention @.> Subject: Re: [ResearchObject/ro-crate] Data Entity identification and handling in tools (Issue #400)

Also, part of the challenge with the Nextflow PR linked at the top of thread is: when you are describing a workflow execution that generates some intermediate files, you might want to represent them in the metadata (as inputs/outputs for steps) but not include them in the crate (e.g. because they are large). So they're not data entities, but File is an intuitive choice for their type. We can advise alternative types to use in this case (e.g. CreativeWork or DigitalDocument), but it feels a bit strange.

Intermediate files are added as CreativeWork in the Nextflow plugin. Their IDs look like:

#task/c6cb99a1e70c4b8f2eb83700dc0145d9/test_1.fastp.fastq.gz

BTW, they have released version 1.4.0 of the pluginhttps://github.com/nextflow-io/nf-prov/releases/tag/1.4.0 with support for Workflow Run RO-Crate.

— Reply to this email directly, view it on GitHubhttps://github.com/ResearchObject/ro-crate/issues/400#issuecomment-2647387492, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFYTWFXHNE6BEZPRAR4QOL2PBVDPAVCNFSM6AAAAABWTPRSCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNBXGM4DONBZGI. You are receiving this because you were mentioned.Message ID: @.***>

Feb 10 '25 20:02 ptsefton

Just chiming in here as we from DataPLANT encountered a similar issue with the validator. We are using the MediaObject type (i.e. FILE in RO-Crate) to describe fragments of files through fragment selectors. They are linked from "normal" file data entities through hasPart. This causes the validator to interpret them as data entities, but there is no corresponding file on the file system. Hence, the validation fails.

We are unsure whether our approach is consistent with the RO-Crate specification. Until now, we thought it is. Could also be that the validator behaves "over-eagerly" here, as @elichad suggested. What's your opinion on this?

Is there anything to add, @HLWeil @kMutagene @muehlhaus?

Mar 28 '25 12:03 floWetzels

ro-crate-py has been recently changed to recursively follow only hasPart from Dataset entities, see https://github.com/ResearchObject/ro-crate-py/pull/216. This has been motivated by the same use case described by @floWetzels , i.e. avoid considering file fragments as data entities.

Mar 28 '25 14:03 simleo

I just had a conversation with @stain about how to handle this issue for 1.2.

Right now, the spec does not explicitly say that all File entities are data entities, which has led to different people/tools having different interpretations of whether they are or not.

There are a few known situations in which this is especially relevant, e.g.

intermediate files from a workflow which aren't included in the final crate (such as in the Nextflow case referenced earlier)
referencing a file which exists in another RO-Crate
@floWetzels file-fragment case above

In these situations people tend towards using File (or MediaObject) as the most intuitive option, but alternatives can be found.

Now, we need to clarify this one way or the other. At this late stage in the preparation of 1.2 I think it's better to make explicit the looser interpretation - which is that File entities do not have to be data entities. We have so far managed to avoid making changes in 1.2 that break backward compatibility with 1.1, and I want to maintain that (which wouldn't be possible with the stricter rule of "all Files are data entities").

Therefore, for 1.2 specifically my suggestion is to add a sentence like:

Tip: While all data entities in an RO-Crate must be either File or Dataset, not all File or Dataset entities in an RO-Crate are necessarily data entities.

I'm going to raise a PR for that and ask @ptsefton for approval (as someone who has been using the stricter interpretation).

Beyond that, I think it is worth revisiting this discussion when we start to think about v2 - I'm not opposed to making more drastic changes in future!

Mar 31 '25 16:03 elichad

I don't think this tip is correct. I think the recent updates all say that File is always a data entity which either has a relatvie path ID in an Attached RO-Crate Package or a fully qualified URI. For cases like files that are yet to appear use another type. Datasets are considered Data entities when they have a path-like URI pointing to a directory or a fully qualified URI to something like another crate, but MAY have abstract IDs with a #

Apr 01 '25 22:04 ptsefton

Another perspective on our specific issue could be that file fragements are indeed valid data entities: they have a unique path and they exist on the files system. It's just a matter of correctly validating this, which at the moment does not happen. However, this would require some formalized standard on fragment selectors (or at least a robust separation symbol in the path), for which I don't know if it exists. (We use # at the moment, like in URLs or markdown, which is probably not the best solution, so we are open for any input.) What do you think of this perspective?

Of course, this does not solve the backwards compatibility issue for intermediate or external files. The suggestions by @simleo and @elichad also seem like good solutions.

Apr 02 '25 12:04 floWetzels

Round 2 - as the previous suggestion wasn't satisfactory, I had a chat with @ptsefton and @stain this morning and we came to agreement on the changes that appear in the new PR #426.

These changes can be summarised as: File and Dataset entities are considered data entities if they use an absolute or relative URI, and not data entities if they use a local identifier starting with #.

RO-Crate consuming tools which want to find data entities should search for all Files and Datasets which have an absolute or relative URI as @id. These should all also appear in hasPart on the Root Data Entity, but hasPart is not the defining factor. (This makes it much clearer for implementation and validation purposes!)

Here are suggestions for how this could work with the cases listed above:

intermediate files from a workflow which aren't included in the final crate (such as in the Nextflow case referenced earlier)
- use a local identifier starting with #. They are not data entities
referencing a file which exists in another RO-Crate
- depends a bit on context, you can choose whether to make it a data entity or not as appropriate
file-fragments
- @id is likely to be a URI, so probably should use a different type than File to get around the problem of not being able to find it as part of the payload (perhaps one of the more specific subtypes of MediaObject, like TextObject?)

In our discussion we thought that hasPart should also be able to include entities that are not data entities, so in the cases above, you could have those entities still be in hasPart if you wanted.

@simleo @floWetzels and anyone else watching this thread, please let us know your feedback on this as soon as possible (preferably by Tuesday). If nobody complains, we'll go ahead in this direction.

Apr 04 '25 15:04 elichad

RO-Crate consuming tools which want to find data entities should search for all Files and Datasets which have an absolute or relative URI as @id

The validator now considers all File and Dataset entities as data entities, so implementing this change will require leaving out those whose @id starts with a # (@kikkomep please check this and correct me if I'm wrong).
ro-crate-py now considers all entities that appear in the root data entity's hasPart as data entities. This will have to change, since it will have to look for all Files and Datasets except those that start with a #.

These should all also appear in hasPart on the Root Data Entity

With the current 1.2 specs that's actually a MUST:

Where files and folders are represented as Data Entities in the RO-Crate JSON-LD, these MUST be linked to, either directly or indirectly, from the Root Data Entity using the [hasPart] property.

So if the above does not change the validator will have to continue reporting a REQUIRED violation if a data entity is not linked to from the root data entity's hasPart.

My impression is that these changes should be feasible.

Apr 07 '25 12:04 simleo

Thanks for the effort and pushing this towards a conclusion, @elichad!

Here some feedback from us:

We are not sure if another type is a good solution for file fragments. MediaObject actually fits perfectly (though the alias File does not), as the files can have any encoding. The subtypes are too specific, TextObject for example does not capture binary encoded files. I think it is unrealistic to always detect or even have a fitting type. MediaObject has the right level of generality.
Coming back to my comment above about fragments being data entities, we also know (from discussions with people from ndfi4bioimage) about some usecases in bioimaging where images are distributed over multiple files. In particular, X images could be distributed over Y files. In that regard, it makes sense to consider this option for the future, what do you think? This perspective would be invalid when using the approach with a different type, right?
Nontheless, we agree that we should work towards a clear definition of data entities. The workaround using ad-hoc ids starting with # would work for our usecase (definitely preferable over a separate type). We just tink it's not the optimal solution.

Let me know what you think of our feedback.

Apr 07 '25 13:04 floWetzels

Regarding fragments, see #200, and the example at https://github.com/ResearchObject/workflow-run-crate/blob/6bfba3f407d028557ed7b3b8603730d0a36a5ba8/docs/examples/draft/ml-predict-pipeline-cwltool-runcrate/ro-crate-metadata.json

Apr 10 '25 10:04 simleo

Solution in #426 is good, just note @floWetzels that it leaves undefined if "@id": "folder/foo.zip#bah.txt is a data entity or not, as it would be a relative path and not start with #, yet it does contain a fragment identifier. I think it would make sense for these to be described the other way with isPartOf back to container folder/foo.zip entity, unless you are listing all its content from its hasPart.

As ZIP do not have a fragment identifier defined, then that approach is still a bit of appropriation, but it does make sense for CWL for instance which does. (See also https://github.com/common-workflow-language/cwlviewer/wiki/Permalinks )

It would also be better to use arcp URIs for these @ids (e.g. referring to file folder2/soup.txt within http://example.com/crate1.zip) as that would give you a root path within the container from which relative paths still work (and also make them absolute URIs and valid data entities within #426 definition) -- this should work equally well for non-zip-like formats as long as you can decide on what is the path after the /

BTW for elements of a document, like a paragraph, Schema.org also have https://schema.org/WebPageElement and children e.g. https://schema.org/Table -- these are probably not suitable for inside of data files like NetCDF.

Apr 11 '25 13:04 stain