suggestions-questions-brainstorming
suggestions-questions-brainstorming copied to clipboard
SPDX identifiers for licenses?
This thread is trying to gather existing best practice, or for such to be established, and perhaps to hear other views.
license property vs SPDX identifier
https://schema.org/license refers to a CreativeWork or URL and is of course useful particularly on all kinds of https://schema.org/CreativeWork beyond documents, e.g. https://schema.org/SoftwareSourceCode and https://schema.org/ImageObject
It is now common best practice in open source software to [use SPDX ids]https://spdx.dev/ids/) for identifying source code's license, you may have come across code comments like:
# SPDX-License-Identifier: GPL-2.0-or-later
But http://schema.org/license requires a URL or Creative Work - so which one to use? And can we classify these with SPDX identifiers even if a specialized license file (with copyright) is linked to? How do we deal with dual-license?
SPDX intro
https://spdx.org/licenses/ lists known open source licenses. These are great as you avoid confusions such as "What do you mean 'BSD license', 2-clause, 3-clause or 4-clause?" - the umabigious BSD-3-Clause can be looked up to https://spdx.org/licenses/BSD-3-Clause
SPDX has known licenses expressed as RDF like (simplified):
<http://spdx.org/licenses/GPL-2.0-or-later>
a spdx:License ;
rdfs:comment "This license was released: June 1991. This license identifier refers to the choice to use code under GPL-2.0-or-later (i.e., GPL-2.0 or some later version), as distinguished from use of code under GPL-2.0-only. The license notice (as seen in the Standard License Header field below) states which of these applies the code in the file. The example in the exhibit to the license shows the license notice for the \"or later\" approach." ;
rdfs:seeAlso "https://www.gnu.org/licenses/old-licenses/gpl-2.0-standalone.html" , "https://opensource.org/licenses/GPL-2.0" ;
spdx:isFsfLibre "true" ;
spdx:isOsiApproved "true" ;
spdx:licenseId "GPL-2.0-or-later" ;
spdx:name "GNU General Public License v2.0 or later" ;
(this RDF seems to only exist in GitHub, although some microdata is embedded it gets the subject wrong).
Using SPDX URIs as @id
So the simple approach, shown in schemaorg/schemaorg#1928, is to just use these URIs like http://spdx.org/licenses/GPL-2.0-or-later directly - @njh in https://www.arduinolibraries.info/libraries/arduino-json.json have opted for the https instead of http variant:
{
"@context": "http://schema.org/",
"@type": "SoftwareApplication",
"name": "ArduinoJson",
"url": "https://arduinojson.org/?utm_source=meta&utm_medium=library.properties",
"author": {
"@type": "Person",
"name": "Benoit Blanchon"
},
"license": "https://spdx.org/licenses/MIT"
}
Many URIs
Many of the licenses have their own URIs as well, and then the usual http vs https etc, so we could have many potential inconsistencies:
- https://opensource.org/licenses/Apache-2.0
- https://spdx.org/licenses/Apache-2.0 (as used in SPDX RDF)
- https://spdx.org/licenses/Apache-2.0.html (as linked to from https://spdx.org/licenses/)
- http://www.apache.org/licenses/LICENSE-2.0 (as linked to from itself and http://www.apache.org/licenses/)
- https://www.apache.org/licenses/LICENSE-2.0 (because redirect to https is scary)
- https://www.apache.org/licenses/LICENSE-2.0.html (...)
- https://www.apache.org/licenses/LICENSE-2.0.txt (this one you can download to make your LICENSE file)
- https://identifiers.org/spdx:Apache-2.0 (yes)
- https://identifiers.org/spdx/Apache-2.0 (in case : is scary)
For listing/mapping https://opendefinition.org/licenses/api/ has a nice list, but it's custom JSON.
Challenges
The SPDX website is inconsistent with it's own RDF and https://spdx.org/licenses/ links to https://spdx.org/licenses/MIT.html (notice https and html) so I guess many will get the alternative URIs - I have also seen the variant NJH uses as most common, e.g. we refer to it from https://www.commonwl.org/user_guide/17-metadata/index.html
SPDX identifiers are also not just identifying a single license, but also expressions covering dual licenses like MIT or Apache-2.0 or exceptions. Some licenses like https://spdx.org/licenses/BSD-3-Clause are templates requiring a copyright year and copyright holder, and so the actual license URL would be a specialized file, say https://github.com/seek4science/seek/blob/master/BSD-LICENSE which would then not immediately be recognizable as the BSD 3-Clause license.
Using identifier from CreativeWork
One way around this could be to use http://schema.org/identifier on an anonymous or local CreativeWork license resource - of course setting the SPDX expression directly as identifier would be easiest, but a bit too much left as implications:
{ "@id": "workflow.cwl",
"@type": "SoftwareSourceCode",
"license": {
"@id": "https://creativecommons.org/licenses/by/4.0/",
"@type": "CreativeWork",
"name": "CC BY 4.0",
"description": "Creative Commons Attribution 4.0 International License",
"identifier": "CC-BY-SA-4.0"
}
}
Using PropertyValue to capture SPDX expressions
More explicit using http://schema.org/PropertyValue identifiers we can better include SPDX expressions, even if there either is no license file, or it is a local specialization:
{ "@id": "dual-licensed.py",
"@type": "SoftwareSourceCode",
"license": {
"@type": "CreativeWork",
"name": "MIT or AGPL 3.0 (or later)",
"description": "Dual-licensed as MIT or AGPL 3.0",
"isBasedOn": [
"https://spdx.org/licenses/MIT",
"https://spdx.org/licenses/AGPL-3.0-or-later",
],
"identifier": {
"@type": "PropertyValue",
"name": "SPDX-License-Identifier",
"value": "MIT OR AGPL-3.0+",
"propertyID": "https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/"
}
}
}
We see that the SPDX expression MIT OR AGPL-3.0+ is captured. I threw in http://schema.org/isBasedOn for good measure, although this would play double-duty with the SPDX license expression without its flexibility or rigidity.
Here I used https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/ as the https://schema.org/propertyID as it explains well the SPDX expressions, and instead of just SPDX I used SPDX-License-Identifier to match what they recommend for code comments. (not sure if propertyId here should be {@id: https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files instead.)
This is much more precise - but unfortunately becomes a bit too nested/repetitive when applied to the base case of just using https://spdx.org/licenses/MIT style URIs directly:
{
"@context": "http://schema.org/",
"@type": "SoftwareApplication",
"name": "ArduinoJson",
"license": {
"@id": "https://spdx.org/licenses/MIT",
"@type": "CreativeWork",
"name": "MIT",
"identifier": {
"@type": "PropertyValue",
"name": "SPDX-License-Identifier",
"value": "MIT",
"propertyID": "https://spdx.github.io/spdx-spec/appendix-V-using-SPDX-short-identifiers-in-source-files/"
}
}
}
Discussion across GitHub
(This section added to lure others in to comment with their views :grin: )
In schemaorg/schemaorg#1928 @njh concludes to use https://spdx.org/licenses/MIT directly as @id
In seek4science/seek#456 we tried to explore this further, as we had initially abused license as a text field with an implied SPDX identifier looked up using https://opendefinition.org/ JSON - we need to distinguish between "data license" and "software license". It suggests the PropertyValue expanded form shown above. Discussions include @fbacall @stuzart @alaninmcr
In radiantearth/stac-spec#378 @mojodna @gkellogg @m-mohr are using the variant https://spdx.org/licenses/MIT.html in JSON-LD
In galaxyproject/galaxy#10408 @jmchilton and @nsoranzo are referencing SPDX from Galaxy workflows, unclear which identifier form (custom YAML?)
In earthcubearchitecture-project418/p418Docs#6 we see @mbjones https://github.com/earthcubearchitecture-project418/p418Docs/issues/6#issuecomment-358169081 suggest a PropertyValue approach as above, but less verbose with propertyID: SPDX string, as https://schema.org/propertyID can be either Text or URL.
The Citation File Format (CFF) (custom YAML) use license_url: https://spdx.org/licenses/MIT and license: "MIT" - see for instance citation-file-format/cff-converter-python#25 by @jspaaks and citation-file-format/citation-file-format#105 with @thomaskrause
In the https://science-on-schema.org guidelines for Dataset metadata, we recommend using SPDX URIs from the RDF files: https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#license
In CodeMeta, which is a schema.org extension for software metadata, we also recommend using SPDX: https://github.com/codemeta/codemeta/issues/67 although the guidelines are not prescriptive.
Some quick thoughts:
- We can surely adopt different variations of SPDX. We only generate JSON-LD on the fly from the STAC metadata files, which are JSON only. We are the only one to append the ".html" to the URL, but can surely remove that to align with others.
- We found that SPDX for data is not very suitable in many cases. There are a couple of data-related licenses missing and many licenses are actually custom/proprietary licenses (although some of the data sets are free), so we went for an additional allowed value "proprietary" (also not ideal), which then adds a link to the actual license.
In the https://science-on-schema.org guidelines for Dataset metadata, we recommend using SPDX URIs from the RDF files: https://github.com/ESIPFed/science-on-schema.org/blob/master/guides/Dataset.md#license
In CodeMeta, which is a schema.org extension for software metadata, we also recommend using SPDX: codemeta/codemeta#67 although the guidelines are not prescriptive.
Just a note from Codemetapy https://github.com/proycon/codemetapy :
"For schema:license, full SPDX URIs are used where possible."