Making GSC repo FAIRer
As a good open source community we want to be aiming to FAIR. In the evolving world of FAIR-Software the importance of metadata about the code is becoming a focus point. There is a movement being led by the software heritage archive called CodeMeta: https://codemeta.github.io/
Essentially its a JSON-LD format file that could be included with the code (in our case in GitHub) to describe the code in machine readable metadata. They have even created a simple "generator" tool to help people create the json-ld file: https://codemeta.github.io/codemeta-generator/
I started creating it using the form, and below is what I ended up with, but it definately needs more of the authors and contributors adding in, as well as more of the "run-time-environment" details added (languages etc):
{ "@context": "https://doi.org/10.5063/schema/codemeta-2.0", "@type": "SoftwareSourceCode", "license": "https://spdx.org/licenses/CC0-1.0", "codeRepository": "https://github.com/GenomicsStandardsConsortium/mixs/", "dateModified": "2023-04-01", "downloadUrl": "https://github.com/GenomicsStandardsConsortium/mixs/releases/tag/mixs6.1.0", "name": "GSC-MIXS", "version": "6.1", "identifier": "https://github.com/GenomicsStandardsConsortium/mixs/releases/tag/mixs6.1.0", "description": "The Genomics Standards Consortium maintain the Minimum Information about any(x) sequence (MIxS) checklists. This code includes the source of truth of the current checklists as well as the tools to provide those checklists in multiple formats. It is envisaged that in the future, tools to validate checklists will also be included.", "applicationCategory": "checklist", "isPartOf": "https://gensc.org", "keywords": [ "genomics", "checklists", "models", "ontologies", "data-sharing" ], "programmingLanguage": [ "link-ml" ], "author": [ { "@type": "Person", "@id": "https://orcid.org/0000-0001-8815-0078", "givenName": "Ramona", "familyName": "Walls", "email": "[email protected]" } ], "contributor": [ { "@type": "Person", "@id": "https://orcid.org/0000-0002-1335-0881", "givenName": "Christopher", "familyName": "Hunter", "email": "[email protected]" } ] }
LinkML generates jsonschema-ld
true, but I assume it doesn't produce CodeMeta JSON-LD.
good point
maybe a conversion from pyproject.toml?
Some of the approaches we use for ODIS may be useful here. Here's a mix of those and some thoughts from my side.
- +1 for encoding metadata about GSC and its activities in JSON-LD
- The semantics we embed in the JSON-LD is key:
- we should avoid using @context that are essentially siloed / very, very domain specific / academicy/researchy
- we should use something like vanilla schema.org for @context as much as we can, as much of the web speaks this and it boosts our discoverability
- If we encode metadata about GSC as an organisation, as well as its projects, code repos, software, etc using vanilla schema.org we can have more direct impact
- Once we have the JSON-LD, we would probably need to include a robots.txt and a sitemap.xml pointing crawlers to the JSON-LD assets. More documentation here
- We can leverage the specification work we've been doing for ODIS (some in development) for
If we do something like this, then the various indexing services and their bots will be able to discover GSC metadata more effectively, making it FAIR at scale. We can also harvest this into IOC-UNESCO ODIS to dovetail with the data feeds from/for the UN Ocean Decade Programme, the Ocean Biomolecular Observing Network (https://github.com/iodepo/odis-arch/issues/146).
This will also dovetail with the GSC MIOP project via BeBOP, which has a sitemap-based ODIS interface and shares metadata about omic protocols in ODIS/JSON-LD+schema.org compatible ways (tested during an EU Horizon project TechOceanS here, with ODIS metadata that the sitemap points to here)
I suppose the JSON-LD would live on the GSC's website somewhere, perhaps embedded in pages or just in a file store.
further to this, there is now a CodeFair tool https://codefair.io/ that can be integrated with a GitHub repo to assist in making it FAIR compliant. (I've not tried it, I only just found out about it!)