code-research-object (JSON-LD) Metadata for software discovery

trafficstars

Following up on our blog post here I'd love to hear your thoughts on the idea of using JSON-LD as a lightweight metadata format for describing scientific software.

Some questions to get us started:

Is this a good idea? The goal when writing about this was to identify both a minimal amount of information necessary to cite software (and therefore receive credit as an author) but also format for doing such.
Is it an error to not consider dependency management here? NodeJS uses package.json to define dependencies. I'd rather see software dependency management handled separately (and not by this file).
What's missing? I'd love to expand this keywords block into a subject/domain/software function block - what schemas and ontologies are available for doing this?

{
  "@context": "http://schema.org",
  "@type": "Code",
  "name": "Fidgit",
  "codeRepository": "https://github.com/arfon/fidgit",
  "citation": "http://dx.doi.org/10.6084/m9.figshare.828487",
  "description": "An ungodly union of GitHub and Figshare http://fidgit.arfon.org",
  "dateCreated": "2013-10-19",
  "license": "http://opensource.org/licenses/MIT",
  "author": [
    {
    "@type": "Person",
    "name": "Arfon Smith",
    "@id": "http://orcid.org/0000-0002-3957-2474",
    "email": "[email protected]"
    },
    {
    "@type": "Person",
    "name": "Kaitlin Thaney",
    "@id": "http://orcid.org/0000-0002-7217-4494",
    "email": "[email protected]"
    }
  ],
  "keywords": "publishing, DOI, credit for code"
}

Jun 24 '14 10:06 arfon

Generally I think this looks good.

The citation property doesn't quite fit, as citation is meant to be an object cited by this work, rather than another identifier for this work - identifier would be better, perhaps, or @id?

Could it also have version and datePublished properties, to mark the date a specific version was released (as these are likely to be used in citation of this object)?

Jun 24 '14 11:06 hubgit

SoftwareApplication ("A software application") might also be a better fit for some uses than Code ("Computer programming source code") - I guess it depends on what's being described, though…

Jun 24 '14 11:06 hubgit

It might be useful to look at expanding citeproc-json to support software. See http://blog.martinfenner.org/2013/07/30/citeproc-yaml-for-bibliographies/ and https://github.com/citation-style-language/schema/blob/master/csl-data.json

Jun 24 '14 23:06 egh

Generally, I quite like your approach. I'm a bit on the fence regarding dependency management. Since your main use case seems to be citations, I think dependencies are the most "objective" citations possible and thus shouldn't be ignored. On the other hand, duplicating such information (e.g. in code.jsonld and package.json) and keeping it in sync might be problematic. Thus, it might be worth looking for alternative approaches that allow to enrich (aka mark up) existing descriptions such as package.json or composer.json.

Btw. are you aware of https://github.com/digitalbazaar/jsonld.js/issues/39?

Jul 01 '14 09:07 lanthaler

I added a code.jsonld file to my repository: https://github.com/gramian/emgr/blob/master/code.jsonld . +1 for using keys from SoftwareApplication, especially softwareVersion.

Jul 07 '14 10:07 gramian

@egh re: research into citeproc-js, Zotero, RDFa, CiteProc : https://forums.zotero.org/discussion/35992/export-to-schemaorg-rdfa-andor-microdata/

Jul 07 '14 22:07 westurner

@arfon The SEON ontologies model much of this domain: http://www.se-on.org/ (I'm not yet aware of any mappings to schema.org).

http://www.reddit.com/r/MachineLearning/comments/1dycyd/why_not_model_computer_programs_its_just_as_fun/c9vc0qa

'ircChannel' is one property that may be worth championing:

http://www.reddit.com/r/javascript/comments/1fq6gj/use_irc_for_your_github_project_make_it_easier/cacxbxe

Jul 07 '14 22:07 westurner

Other overlapping approaches to consider either aligning to or working with:

Python's implementation of the Trove classifers: http://legacy.python.org/dev/peps/pep-0301/#distutils-trove-classification

Debian's Upstream Metadata format: https://wiki.debian.org/UpstreamMetadata#Fields

Oct 06 '14 15:10 mr-c

Also see "Implementing Transitive Credit with JSON-LD": http://arxiv.org/abs/1407.5117

Oct 06 '14 16:10 danielskatz

Also, an update on this work just posted this week by @acabunoc : http://mozillascience.org/code-as-as-research-object-new-phase/

Oct 07 '14 16:10 kaythaney

Relevant to this discussion too: http://softwarediscoveryindex.org/

Oct 08 '14 11:10 brainstorm

Python's implementation of the Trove classifers: http://legacy.python.org/dev/peps/pep-0301/#distutils-trove-classification

Notes from http://lists.w3.org/Archives/Public/public-vocabs/2014Oct/0018.html:

[...] There are structured fields for Python Packaging metadata [2][3] and there are tables in warehouse [4]. [...]

.

Debian's Upstream Metadata format: https://wiki.debian.org/UpstreamMetadata#Fields

http://packages.qa.debian.org/p/python2.7.ttl
http://packages.qa.debian.org/p/python2.7.rdf

Oct 08 '14 11:10 westurner

Other relevant fields:

  "dateModified": "schemaorg:dateModified",
  "datePublished": "schemaorg:datePublished"

This brings up a larger issue, in that some of these examples may be arrays: particularly, citation (parts of the work can be cited in multiple papers), and repositories. We've already seen a lot of projects switch from local svn environments at universities, to sourceforge, to GitHub. It's important to make sure that we have a way of recording various changes in sources. An array like so:

{ 
  "codeRepository":  [{
    "dateCreated": "schemaorg:dateCreated", 
    "src": "[email protected]:mozillascience/code-research-object", 
    "url": "https://github.com/mozillascience/code-research-object", 
    "versioningSoftware": ["git", "svn", ...] 
  }, 
  {
    "dateCreated": ... 
  }], 
}

Oct 08 '14 17:10 RichardLitt

@RichardLitt

http://www.w3.org/TR/json-ld-syntax/#terminology :

array An array structure is represented as square brackets surrounding zero or more values. Values are separated by commas. In JSON, an array is an ordered sequence of zero or more values. While JSON-LD uses the same array representation as JSON, the collection is unordered by default. While order is preserved in regular JSON arrays, it is not in regular JSON-LD arrays unless specifically defined (see section 6.11 Sets and Lists).

... @list (or @container in the context): http://www.w3.org/TR/json-ld-syntax/#sets-and-lists

Oct 12 '14 00:10 westurner

@westurner Good point. Perhaps startDate, endDate, instead? Or an active field.

Looking at my code block above, I meant to have the value for codeRepository be an array of objects. I must have been tired; sorry about that.

Oct 15 '14 01:10 RichardLitt

Sounds like a good solution to me. On Oct 14, 2014 8:55 PM, "Richard Littauer" [email protected] wrote:

@westurner https://github.com/westurner Good point. Perhaps startDate, endDate, instead? Or an active field.

Looking at my code block above, I meant to have the value for codeRepository be an array of objects. I must have been tired; sorry about that.

— Reply to this email directly or view it on GitHub https://github.com/mozillascience/code-research-object/issues/15#issuecomment-59147541 .

Oct 15 '14 16:10 westurner

May be OT, but TIL about the AppStream software package metadata interoperability spec:

https://en.wikipedia.org/wiki/AppStream
http://www.freedesktop.org/wiki/Distributions/AppStream/
http://www.freedesktop.org/software/appstream/docs/

Oct 23 '14 00:10 westurner

@arfon wrote:

I'd love to hear your thoughts on the idea of using JSON-LD as a lightweight metadata format for describing scientific software.

Curious if you're targeting standalone scientific software or software used in the context of research. If the latter, it would be a shame to not include/adopt any of the existing metadata conventions for describing the underlying data that the software is designed to consume.

Relevant discussion of the potential of JSON-LD as a format for dataset metadata at dataprotocols/dataprotocols#110.

Oct 28 '14 20:10 joyrexus

The use case that this needs to pass to make it useful for processing.

Given LD info on some web page, provide Edit this page on Github link. For example http://trafficserver.readthedocs.org/en/latest/

Nov 19 '14 17:11 techtonik

Mention European Commission's Asset Description Metadata Schema for Software (ADMS.SW) and discussions in:

https://github.com/codemeta/codemeta/issues/41
https://github.com/WhiteHouse/source-code-policy/issues/117

Jun 30 '16 09:06 ceefour

code-research-object code-research-object copied to clipboard

(JSON-LD) Metadata for software discovery

code-research-object
code-research-object copied to clipboard