ro-crate icon indicating copy to clipboard operation
ro-crate copied to clipboard

Refer to Dataset as `Folder` or `Directory`

Open multimeric opened this issue 7 months ago • 12 comments

This recently came up when discussing RO-Crate with someone not familiar with the spec. We alias MediaObject to File since it makes it more intuitive. Why do we not do the same for Dataset? I have checked, and neither Folder nor Directory are currently in the RO-Crate 1.2 context, so it seems fair game.

I suppose I would propose for this to be in RO-Crate 2.X since it's a fairly substantial change.

multimeric avatar May 20 '25 04:05 multimeric

Added to 2.0 milestone for consideration. Would not be a simple change. My initial thought is that is probably better to get proper File and Folder classes in RO-Terms or Schema.org and leave dataset as a more abstract entity

ptsefton avatar May 21 '25 00:05 ptsefton

Yes! If that option is on the table, I absolutely think that physical files and directories should have uniquely defined RO-Crate types. Then validation becomes easier, finding the root data entity and other data entities becomes easier, we don't have to mandate the use of a term alias anymore, and we can name them whatever we want.

multimeric avatar May 21 '25 01:05 multimeric

Everything is on the table for 2.0 - as long as it's in line with the general approach and philosophy of RO-Crate to be linked data compatible but approachable without requiring linked-data (RDF) tools.

It's really surprising how little love the concept of a "File" gets - I have been unable to find a File Class in any respectable ontology and even if you go read about POSIX it's not clearly defined, or look at what the preservation/archive people do in OAIS and they avoid talking about files for some reason.

We may have to just define it in RO-Terms and be done with it. When we started with DataCrate and then RO-Crate I was trying to avoid defining anything that would need looking after. RO-Crate is now established enough for us to fill in gaps where needed with our own stuff.

ptsefton avatar May 21 '25 01:05 ptsefton

Here's a nice File class from the REPRODUCE-ME ontology:

  • BioPortal Link
  • IRI: https://w3id.org/reproduceme#File
  • Definition: "A storage to store data. For example, a computer file."

I think the description is a bit weak, but the ontology as a whole has lots of useful stuff we might want in the spec.

Here's another from Semanticscience Integrated Ontology:

  • BioPortal Link
  • IRI: http://semanticscience.org/resource/SIO_000396
  • Definition: "A file is an information-bearing object that contains a physical embodiment of some information using a particular character encoding."

multimeric avatar May 29 '25 06:05 multimeric

+1 to having the discussion - but feeling unease with adding Directory or Folder more explicitly in the spec.

Simple reason: we do not need it. And yes, less is less, but it can allow for more. See also https://www.goodreads.com/quotes/285224-shape-clay-into-a-vessel-it-is-the-space-within

benefit comes from what is there;
Usefulness from what is not there -- from the Tao Te Ching

Furthermore. Directories and Folders have always been a mistake. Classic system modelling theory warns about only focusing on the problem space and avoid modelling the "old solution" into your new system that is designed to replace it. This is however precisely what happened with these: they are echoes of cabinets and binders that constrained access and associated aspects to unique physical presence of files and records. -- As back-ground material to this claim I highly recommend the classic Ted Nelson rant on "The Nightmare of Files and Directories" at http://www.thetednelson.com/computers_for_cynics.php

To conclude I would argue the ro-crate spec only needs 'things' and 'loose collections of things', the former are our data-entities, basically anything with an @id in the @graph, the latter also does not need a constraining type, the simple presence of a hasPart relation allows inferring that role.

In general, on a semantic level we should focus only on unambiguous ways to simply place and find information in a reliable way. We can do that by simply sticking to well designed and documented property-paths, or de-facto shapes (duck-typing for data access if you like) This allows us to avoid putting heavy constraints and effects on actual @type declarations. This allows more freedom, a broader use of the spec, more liberty for users to use the @type to their specific use, and have them put other kinds of eggs in our baskets?

As such I would even suggestion to change/reduce the usage of DataSet in ro-crate to only those entities that are expected to have that type (implicit or inferred) by their presence in some catalogue.

I understand this is sliding towards a more philosophical standing and observation. However, what drew me to ro-crates in the first place, its appeal, was (still is) this sensible 'elegance' and 'simplicity first' approach. We could argue that the 1.x iterations have been growing some mist around that core? Maybe we might want to make some formal statements or guiding principles about putting that front and central again when working on 2.0.

In Locrian Law (classic Greece) this policy existed (see https://www.purplemotes.net/2008/10/19/some-peculiar-legal-institutions/ )

if any one wishes to enact a new statute, he proposes it with his neck in a noose, and if the statute is judged to be good and useful, the proposer goes away alive, but, if not, the noose is drawn and he dies. … [in more than two hundred years] they had only one new statute passed.

I agree this is brutally extreme, barbaric, outdated, unfitting modern innovative needs. Still the imagery might be effective when we ponder adding new stuff against the core question "Do we need it?"

mpo-vliz avatar Oct 15 '25 09:10 mpo-vliz

I think you are arguing something beyond the scope of this issue. RO-Crate already explicitly models file hierarchies, and so I propose that it does so even more explicitly.

multimeric avatar Oct 15 '25 10:10 multimeric

I got carried away indeed :-)

repeating my basic point

What is not explicit enough about hasPart? To me: any enitity with those around is part of the hierarchy we model.

So any added extra @type on those lower levels: ro-crate spec does not care? Why confuse people with an alternative way to model hierarchy?

how that angle does fit this suggestion in imho

The remaining @type we are concerned with is the one on the root entity. And that has a totally different role imho, that is not about (internal) content hierarchical structuring, but about identifying the 'crate itself' as an object to be found in a schema:DataCatalog. To me logically a schema:DataSet

I see no value in hiding that through an alias.

and going further

I do not think aliasing in general is helping to battle the confusion. I find it a false way to appease users, or even keep them ignorant, in stead of guiding them to what is going on.

So yeah, I find it needlessly surprising that our File ends up being a schema:MediaObject. So I would rather not see us add more of that, but try to get rid of it. That is also how I read the ambition in https://github.com/ResearchObject/ro-crate/issues/385

mpo-vliz avatar Oct 20 '25 12:10 mpo-vliz

Interestingly, it seems that the Croissant spec defines FileObject and FileSet, which explicitly extend schema.org. This could be an easier avenue than defining our own File or Directory.

multimeric avatar Oct 21 '25 00:10 multimeric

That looks promising.

The original aliasing was a bad implementation choice I made in DataCrate.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Michael Milton @.> Sent: Tuesday, October 21, 2025 10:25:02 AM To: ResearchObject/ro-crate @.> Cc: Peter Sefton @.>; Comment @.> Subject: Re: [ResearchObject/ro-crate] Refer to Dataset as Folder or Directory (Issue #439)

[https://avatars.githubusercontent.com/u/5019367?s=20&v=4]multimeric left a comment (ResearchObject/ro-crate#439)https://github.com/ResearchObject/ro-crate/issues/439#issuecomment-3424223015

Interestingly, it seems that the Croissant spechttps://docs.mlcommons.org/croissant/docs/croissant-spec.html#resources defines FileObject and FileSet, which explicitly extend schema.org. This could be an easier avenue than defining our own File or Directory.

— Reply to this email directly, view it on GitHubhttps://github.com/ResearchObject/ro-crate/issues/439#issuecomment-3424223015, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAFYTWGMAVBVDVYEMJJRXLD3YV4N5AVCNFSM6AAAAAB5PKHQI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMRUGIZDGMBRGU. You are receiving this because you commented.Message ID: @.***>

ptsefton avatar Oct 21 '25 00:10 ptsefton

@multimeric The Croissant stuff looks promising but ATM it's in their namespace "cr": "http://mlcommons.org/croissant/" not schema.org. And http://mlcommons.org/croissant/FileObject gives you a 404.

If there's a plan for these to be added to Schema.org though they'd be good.

ptsefton avatar Oct 24 '25 04:10 ptsefton

Does that matter? Many of the terms that RO Crate uses in its context are not schema.org or don't resolve. The nice part is that Croissant fits into the same hierarchy.

Is it the rule that every entity must have schema.org types? I have loosely interpreted this to mean that external child types like bioschemas are allowed but maybe that's not the case.

multimeric avatar Oct 24 '25 05:10 multimeric

The croissant term identifiers should work as they are theirs to mint, but we can work around their non-resolution in a profile crate (even base profile) as we already had to do for codemeta. (E.g. see https://www.researchobject.org/ro-crate/specification/1.2/ro-crate-preview.html#https%3A//codemeta.github.io/terms/ meanwhile @dgarijo has been pushing them to make PIDs).

It is worth pointing out to Croissant that these are not good identifiers before they point/redirect to something! Perhaps @ljgarcia who is a regular in their meetings could help out?

stain avatar Oct 24 '25 12:10 stain