Add `--json` flag to lint
All the infrastructure for this is already in place, but it could be really helpful for people who want to programmatically automate stuff: a reuse json interface. It'd simply output a combination of reuse lint and reuse spdx, but in JSON format.
I've had a longer think about this.
reuse has an internal representation for its data. This data is dictionary-like, and can very easily be transformed into JSON. However, if the internals were somehow changed, so would the output. This is not ideal as an outward API. This could be circumvented by implementing a bridge between the new internals and the formerly-internal API, but that's a bit of a stupid bridge:
- The internal structure probably isn't the best way to represent the data anyway.
- We would have to document the JSON API.
- People really dislike having to learn more APIs, and usually just divine the API from the output.
- To quote Raymond Hettinger: There must be a better way!.
So I recently learnt that SPDX soon wants to adopt JSON and YAML as new output formats. Or rather, they want to formalise those formats. @silverhook brought up that YAML is likely to become dominant among the output formats. I would personally prefer JSON over YAML because JSON is in the Python standard library, and is imo a much cleaner, clearer standard, but I can concede that YAML is a lot prettier to look at.
So instead of doing our own JSON API, it would be much nicer to simply output SPDX JSON/YAML. If the SPDX JSON/YAML doesn't contain some stuff that we need, we can:
- add custom fields;
- add a separate section/key at the bottom of the document;
- add
reuse jsonafter all with a standalone API.
The way I see it, this could be implemented really well through https://github.com/spdx/tools-python as a library. We could contribute to that project to add some of the things we need, and remove some code in this repository in the process. As a plus, if we use that repository (and patch it up), we get multiple output formats for free. It'd be a matter of reuse spdx --json and reuse spdx --yaml.
The only problem, however, is that I am not entirely happy with tools-python. It's not documented very well, and it has some silly things going on. I'd tally them, but I'll withhold for now.
I'm fairly confident, however, that tighter integration with an improved tools-python would improve this tool a lot.
Interesting thoughts! One way to bring this issue on SPDX' table could be to write an email to the spdx-tech mailing list (https://lists.spdx.org/g/spdx-tech) and perhaps have this as a topic in one of their calls.
My thoughts are that YAML should deprecate tag:value, but I think JSON will be the more popular choice in SDPX.
It is already happening, see: https://github.com/spdx/spdx-spec/blob/7ca9672d1abf2b60985b840d5705a6539ee36635/chapters/1-rationale.md#17-format-requirements-
In any case, I agree that joining the SPDX Technical mailing list and calls would make sense for at least one of you, if not both.
I've been subscribed to the lists for a long time, but very passive. I'll be able to be more proactive in the coming months; see if I can attend some meetings.
The new SPDX spec should include JSON and YAML by now :)
I'd like to have more support for other SPDX formats as well. RDF and YAML are nice, but JSON is something I can directly inject into Elasticsearch and generate dashboards for a project overview. Now that it is part of the SPDX spec it feels like the right moment to adopt this.
JSON spec: https://github.com/spdx/spdx-spec/blob/master/schemas/spdx-schema.json
I'd have to look into the source code, but I assume that maybe the reuse spdx internal datastructure would have to be modifed but mainly it about serializing into JSON.
As the JSON spec contains the spdxVersion field, I don't think the REUSE tool needs to handle multiple SPDX versions necessarily. It just reports the version it outputs and it is up to the client to handle it properly. And users can always explicitly install an older version of REUSE. Still, being able to output different versions in the future might be a nice-to-have.
In today's maintainers call, we agreed that we want to have this in the next release.
What we agreed on:
- The focus shall be on
lintfirst. Perhaps in the futuresupported-licenses, andspdxwill be covered in #394, - The JSON output should carry a version number in case we change the format in the future.
These questions remain:
- Shall the output work via a flag like
--json, or shall it be a separate subcommand likereuse json? (Personal note: flag makes more sense IMHO) - Do we also want to support YAML? (Personal note: I think we should focus on JSON first)
- How should it work together with linting a subset of the repository as mentioned in #512 in case this would become a separate subcommand?
We should also take #256 into consideration when working on this feature
Personally I like YAML, but since there’s already official SPDX Tools that are well capable of transforming one valid SPDX format into another, let’s first concentrate on getting one right, and let that one be the one that will be the most useful for users.
We had another meeting in which we've discussed the remaining questions and also the relation to other topics linked with the lint subcommand.
The decisions we've made:
- The JSON output shall be toggable with a
--jsonflag to thelintsubcommand. - We also will introduce a
--verboseflag, covered in #256.--jsonimplies this flag, so JSON is always verbose. - We are also tackling #512 under the
lintsubcommand. However, the JSON feature shall come first as this probably requires the most refactoring. - YAML can come later optionally, but it's not a priority.
- The JSON output shall always show all keys, even errors keys, but empty if no error occurs. See the suggestion for a format below.
- The plain and the JSON output shall be in parity whereever it makes sense. However, we figured that it's OK if the JSON output does not have three keys for files missing a) only license, b) only copyright, and c) both copyright and license info, but only two. That would mean files lacking both information would appear twice but that shouldn't be an issue for machine-readable output.
Suggestion of JSON outputs
Before starting to work on the code, we want to agree on a format for the JSON output. @carmenbianca also mentioned that we could draw inspiration from the ProjectReport class in report.py as this already contains an extensive dictionary. Here are my suggestions, without having a look at the current structure in the code though.
From the outputs I deleted all logger errors as I imagine these would be the same in both output types.
Example for compliant repo
I base this on the master branch of the example repo which is compliant.
Current output of reuse lint
# SUMMARY
* Bad licenses:
* Deprecated licenses:
* Licenses without file extension:
* Missing licenses:
* Unused licenses:
* Used licenses: CC-BY-4.0, CC0-1.0, GPL-3.0-or-later
* Read errors: 0
* Files with copyright information: 6 / 6
* Files with license information: 6 / 6
Congratulations! Your project is compliant with version 3.0 of the REUSE Specification :-)
Suggested output of reuse lint --json
{
"json_version": "1.0",
"reuse_version": "3.0",
"non_compliant": {
"missing_licenses": [],
"unused_licenses": [],
"deprecated_licenses": [],
"bad_licenses": [],
"licenses_without_extension": [],
"missing_copyright_info": [],
"missing_licensing_info": [],
"read_errors": []
},
"files": {
".gitignore": {
"copyright": {
"value": "SPDX-FileCopyrightText: 2019 Jane Doe <[email protected]>",
"source": ".gitignore"
},
"license": {
"value": "CC0-1.0",
"source": ".gitignore"
}
},
"Makefile": {
"copyright": {
"value": "SPDX-FileCopyrightText: 2019 Jane Doe <[email protected]>",
"source": "Makefile"
},
"license": {
"value": "GPL-3.0-or-later",
"source": "Makefile"
}
},
"README.md": {
"copyright": {
"value": "SPDX-FileCopyrightText: 2019 Jane Doe <[email protected]>",
"source": "README.md"
},
"license": {
"value": "GPL-3.0-or-later",
"source": "README.md"
}
},
"img/cat.jpg": {
"copyright": {
"value": "SPDX-FileCopyrightText: 2017 Peter Janzen",
"source": "img/cat.jpg.license"
},
"license": {
"value": "CC-BY-SA-4.0",
"source": "img/cat.jpg.license"
}
},
"img/dog.jpg": {
"copyright": {
"value": "SPDX-FileCopyrightText: 2017 Raffael Herrmann",
"source": "img/dog.jpg.license"
},
"license": {
"value": "GPL-3.0-or-later",
"source": "img/dog.jpg.license"
}
},
"src/main.c": {
"copyright": {
"value": "SPDX-FileCopyrightText: 2019 Jane Doe <[email protected]>",
"source": "src/main.c"
},
"license": {
"value": "GPL-3.0-or-later",
"source": "src/main.c"
}
}
},
"summary": {
"used_licenses": [
"CC-BY-SA-4.0",
"CC0-1.0",
"GPL-3.0-or-later"
],
"files_total": 6,
"files_with_copyright_info": 6,
"files_with_licensing_info": 6,
"compliant": true
}
}
Example for many errors
Check out this branch to get the basis for the output below. Additionally, I made README.md unreadable to trigger one more error class.
Current output of reuse lint
# BAD LICENSES
'InVaLiD' found in:
* LICENSES/InVaLiD.txt
# DEPRECATED LICENSES
The following licenses are deprecated by SPDX:
* GPL-2.0+
# LICENSES WITHOUT FILE EXTENSION
The following licenses have no file extension:
* LICENSES/Apache-2.0
# MISSING LICENSES
'MIT' found in:
* Makefile
* img/cat.jpg
* img/dog.jpg
# UNUSED LICENSES
The following licenses are not used:
* Apache-2.0
* CC-BY-SA-4.0
* GPL-2.0+
* InVaLiD
# READ ERRORS
Could not read:
* README.md
# MISSING COPYRIGHT AND LICENSING INFORMATION
The following files have no copyright and licensing information:
* src/main.c
The following files have no copyright information:
* Makefile
The following files have no licensing information:
* .gitignore
# SUMMARY
* Bad licenses: InVaLiD
* Deprecated licenses: GPL-2.0+
* Licenses without file extension: Apache-2.0
* Missing licenses: MIT
* Unused licenses: Apache-2.0, CC-BY-SA-4.0, GPL-2.0+, InVaLiD
* Used licenses: MIT
* Read errors: 1
* Files with copyright information: 3 / 5
* Files with license information: 3 / 5
Unfortunately, your project is not compliant with version 3.0 of the REUSE Specification :-(
Suggested output for reuse lint --json
{
"json_version": "1.0",
"reuse_version": "3.0",
"non_compliant": {
"missing_licenses": [
"MIT"
],
"unused_licenses": [
"CC-BY-SA-4.0",
"GPL-2.0+",
"InVaLiD",
"Apache-2.0"
],
"deprecated_licenses": [
"GPL-2.0+"
],
"bad_licenses": [
"InVaLiD"
],
"licenses_without_extension": [
"Apache-2.0"
],
"missing_copyright_info": [
"Makefile",
"src/main.c"
],
"missing_licensing_info": [
".gitignore",
"src/main.c"
],
"read_errors": [
"README.md"
]
},
"files": {
".gitignore": {
"copyright": {
"value": "Copyright (c) 2012 Myself",
"source": ".gitignore"
},
"license": {
"value": null,
"source": null
}
},
"Makefile": {
"copyright": {
"value": null,
"source": null
},
"license": {
"value": "MIT",
"source": "Makefile"
}
},
"img/cat.jpg": {
"copyright": {
"value": "SPDX-FileCopyrightText: 2022 me",
"source": "img/cat.jpg.license"
},
"license": {
"value": "MIT",
"source": "img/cat.jpg.license"
}
},
"img/dog.jpg": {
"copyright": {
"value": "2022 me",
"source": ".reuse/dep5"
},
"license": {
"value": "MIT",
"source": ".reuse/dep5"
}
},
"src/main.c": {
"copyright": {
"value": null,
"source": null
},
"license": {
"value": null,
"source": null
}
}
},
"summary": {
"used_licenses": [
"MIT"
],
"files_total": 5,
"files_with_copyright_info": 3,
"files_with_licensing_info": 3,
"compliant": false
}
}
Open questions/comments
- Not covered in my examples are files that contain multiple copyright holders or licenses. So perhaps we'd have to make the content of
.files.*.copyright.valuean array instead of a string, even if it only contains one element. - As the
sourcekey in.filesI put the actual file name or the path to the .license or DEP5 file. We could also make the possible valuesfile,dotlicenseordep5, but I figured that the full path makes more sense for debugging. - In the suggested JSON outputs, I do not duplicate the
read_errorskey or theunused_licencesin the summary, like the plain output does. I think that's fine this way, the summary should show info that's otherwise not available. Would you agree? Or shall we show a counter of these occurences? Right now you'd have to count the elements in the arrays for this. - Unreadable files are not added to the
files_totalor to thefileskey. This is en par with the plain output, but I wanted to mention this. - The
fileskey also shows theSPDX-FileCopyrightTexttag in thevaluekey. As per #536 we consider to remove that from thespdxsubcommand, so we may also want to do it here.
Great work!
Covering multiple copyright values was my first thought reading this:
".gitignore": {
"copyrights": [
{
"value": "SPDX-FileCopyrightText: 2019 Jane Doe <[email protected]>",
"source": ".gitignore"
},
{
"value": "SPDX-FileCopyrightText: 2012 John Doe <[email protected]>",
"source": ".gitignore"
},
],
"license": {
"value": "CC0-1.0",
"source": ".gitignore"
}
},
And maybe the same for licenses as well?