bdbag icon indicating copy to clipboard operation
bdbag copied to clipboard

FAIR Protocol Buffer?

Open krobasky opened this issue 6 years ago • 5 comments

I see this repo is under 'fair-research' - has anybody started on defining a FAIR protocol buffer?

krobasky avatar Mar 30 '18 23:03 krobasky

Apologies, but it is not clear to me what "FAIR protocol buffer" is supposed to mean in the context of the bdbag software. Would it be possible for you to provide some more detail or reference material?

mikedarcy avatar Apr 03 '18 18:04 mikedarcy

Hi Mike - perhaps my question is misplaced, it relates to the meta-data requirement on the bdbag in order to enable FAIRness; e.g., provenance, unique identifier, keywords, licensing, that sort of thing. Thoughts?

krobasky avatar Apr 03 '18 21:04 krobasky

The (BD)Bag specification describes a container: it is silent on many of the issues raised in the FAIR principles, like data licenses and vocabularies. However, the metadata directory provides a natural place to address those issues. We can, for example, include Research Object (RO) metadata: see https://github.com/fair-research/bdbag/blob/master/profiles/bdbag-ro-profile.json. (See https://n2t.net/minid:b9dt2t for an example of a BDBag that includes simple RO metadata.)

As Carl Kesselman noted in a recent email exchange, one could address the licensing issue, for example, by:

  1. Adding the actual license text as an asset in the BDBag and have it accessible either in the data directory or via the FETCH.TXT
  2. Using the key/value metadata in the BDBag to associate a license URI or PID with the bag. We could easily extend the profile for BDBag to include this. Extending the key/value metadata is a standard part of the BagIT spec so this is totally acceptable.
  3. Specifying the license as additional research object metadata that you associate with an asset (i.e. file) along with the other file-specific attributes, such as the file type from OBI.

If such conventions are defined, we can integrate them into the BDBag tools.

ianfoster avatar Apr 03 '18 21:04 ianfoster

Myself and a student have been reviewing various community FAIR efforts, mapping these to requirements for a simple metadata model. We considered those ambitious, rigorous efforts such as DATS and HCLS, and decided to start with a more rudimentary, well-scoped set of requirements that are computable, but also decoupled from implementation. For example, we took into account the convention you describe for licensing, and we also take into account versioning for objects, APIs, and even ID's (consider, for example, AAC53040 is the accession ID for the p53 protein sequence object, and the most recent version is AAC53040.1). What is the best format for sharing these conventions for your consideration and feedback? Would a protocol buffer be a proper format, or a JSON, or...?

krobasky avatar Apr 04 '18 02:04 krobasky

I agree that more needs to be done to expand the FAIR metadata needed.

Many of those requirements are covered by the underlying specs, for instance Research Object Bundle manifests lists basic provenance per resource. BDBags support RO manifest using the bdbag_ro.py module.

I will admit license was not listed there, we can in theory use the dct:license (from Dublin Core Terms) property in the metadata/manifest.json - that way you can assign license per aggregated file. It is however not directly listed in RO spec so it would be a JSON-LD extension which would need to be added manually by bdbag_ro.py - for instance:

"aggregates": [
  {  "uri": "../data/file.txt",
     "dct:license": {
       "uri": "http://www.apache.org/licenses/LICENSE-2.0",
       "name": "Apache License, Version 2.0" 
     }
  }
]

But this should probably feed upstream to include in a general Research Object profile of FAIR metadata attributes.

There is also schema.org/license as used by for instance BioSchemas Dataset.

stain avatar Apr 23 '18 13:04 stain