specifications icon indicating copy to clipboard operation
specifications copied to clipboard

Dataset: how to represent a collection of distinct related files

Open cmungall opened this issue 3 years ago • 3 comments

Let's say I have a directory full of files - perhaps the results of different genome annotation analyses all on the same sample

With frictionless, this might be represented as one DataPackage, with multiple DataResources

DCAT have a series of different examples of loosely structured datasets, e.g example 57 which is analogous:

https://www.w3.org/TR/vocab-dcat-3/#ex-elaborated-bag

here there is one "container" DataSet and multiple individual DataSets, each with their own serialization

Is bioschemas intended to be isomorphic to DCAT3? Should we use the same structure and link to the same documentation?

hasPart is in the profile but it has a very generic description:

Schema: Indicates an item or CreativeWork that is part of this item, or CreativeWork (in some sense). Inverse property: isPartOf

Or perhaps the container should be a catalog?

cmungall avatar May 16 '22 18:05 cmungall

The Bioschemas Dataset profile is defined over the existing schema.org Dataset type which itself is drawn from DCAT (version 2).

There are of course multiple ways you could model this, and that would be up to the deployer of the markup, i.e. as one Dataset with multiple parts which themselves are Dataset or as a collection of Datasets. In both cases, there would also be a DataCatalog which is the web site that makes the Datasets available.

:dc a DataCatalog ;
    dataset :x1, :x2, ...

or

:dc a DataCatalog ;
    dataset :x .
:x hasPart :x1, :x2, ...

I think that both of these are compatible with the proposed profile and it comes down to the markup developer's personal choice.

AlasdairGray avatar May 18 '22 16:05 AlasdairGray

It seems that modeling this as one Dataset with multiple distributions would be discouraged though? Even if the cardinality of distribution is >1 (#575) it seems the intent is for distribution is to model an alternate serialization of the same data, rather than different parts of the dataset?

cmungall avatar May 24 '22 14:05 cmungall

In my examples, I didn't get to the distributions. That would be added onto the Dataset using the distribution property which should be many since it could be in different RDF serialisations or csv or a multitude of other formats.

To keep things semi-concrete, the markup would become

:dc a DataCatalog ;
    dataset :x1, :x2, ...
:x1 a Dataset ;
    distribution :x1csv, ...

or

:dc a DataCatalog ;
    dataset :x .
:x hasPart :x1, :x2, ...
:x1 a Dataset ;
    distribution :x1csv, ...

AlasdairGray avatar Jun 01 '22 10:06 AlasdairGray