license-list-XML
license-list-XML copied to clipboard
Formatting of plain license text in JSON data is broken
Moving issue from SPDX tools. Originally submitted by @sschuberth
At the example of Apache-2.0, when extracting the licenseText string to a file, I'd expect that file to be exactly formatted like the original plain text license including leading spaces and blank lines. However, the JSON string is formatted like
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
(note the missing leading spaces but added trailing spaces) which not only does not match the original text but also is quite ugly.
The way we are maintaining the license information in the license-list-XML github repository it is not feasible to retain the formatting of the original text since XML removes the white space and we do not have enough tags to retain all of the formatting.
That being said, we could do a better job of formatting the text and making it look prettier.
The code for this has actually moved to a different project: LicenseListPublish
The way we are maintaining the license information in the license-list-XML github repository it is not feasible to retain the formatting of the original text
I believe that's exactly the problem then, and XML shouldn't be used as the primary format to store the original text. It could still be used as a format to apply SPDX-specific formatting, however.
+1 to maintaining the original format of the licenses. Currently, for example, these two license texts differ by newlines. https://github.com/spdx/tools/blob/master/resources/stdlicenses/MIT.jsonld https://github.com/OpenSourceOrg/licenses/blob/master/texts/plain/MIT
While it is not a huge deal for consumers to find some wordwrapping implementation and run the text through before, say, generating a NOTICE file, it is extra hassle and will lead to apparent differences. Would be great to generate clarity and simplicity around licenses by using the same canonical form everywhere.
Resolves in PR spdx/LicenseListPublisher#83
I'm reopening this to remind myself that the issue hasn't really been fixed yet. While PR spdx/LicenseListPublisher#83 laid the foundation for getting it fixed, https://raw.githubusercontent.com/spdx/license-list-data/b8d6af45ad2fcfed61bb85a8ad068aa4a77eadf9/text/Apache-2.0.txt still does not match https://www.apache.org/licenses/LICENSE-2.0.txt formatting-wise.
IIUC @goneall correctly, the remaining thing to do is to commit the original / upstream plain text licenses to https://github.com/spdx/license-list-XML/tree/master/test/simpleTestForGenerator and then rerun this publisher to make the correct licenses show up at https://github.com/spdx/license-list-data/tree/master/text. I'll try to wrote a script for that to finally resolve this long-stand issue.
@sschuberth - Just going through the older issue. Any thoughts or progress on updating the text in the license-list-XML repo?
Sorry @goneall, this issue has slipped my mind. But would you agree that the mentioned approach is the way to go:
the remaining thing to do is to commit the original / upstream plain text licenses to https://github.com/spdx/license-list-XML/tree/master/test/simpleTestForGenerator and then rerun this publisher to make the correct licenses show up at https://github.com/spdx/license-list-data/tree/master/text.
@sschuberth I agree with the above approach.
I'll move this issue over to the license-list-XML repo since this is where the work will be done.
@swinslow @jlovejoy FYI - if you disagree with updating the test text to fix the formatting in JSON, please add to this issue and cc @sschuberth
@sschuberth @goneall - I'm not sure I'm following the implementation details here, but I think the goal is to get to a point to where the text files at https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator are "formatted" to look or reflect any original text file for a given license (e.g, https://www.apache.org/licenses/LICENSE-2.0.txt ) or at least has some form of line length limit to avoid horizontal scrolling?
if we do that, then the formatting will show up better at https://github.com/spdx/license-list-data/tree/master/text.
is that right-ish?
I'm all in favor of better formatting such that people can "reuse" text files. I think we need to document which text file directory is the best to use as well.
Also, keep in mind that the text files created in https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator are created as part of the PR when the license is accepted to the SPDX License List. We have a GSoC project that would add functionality to create this text file automatically via the online submission tool, instead of people having to create it manually. So, any formatting parameters should be included for that project.
@jlovejoy
if we do that, then the formatting will show up better at https://github.com/spdx/license-list-data/tree/master/text.
Close - the specific issue is related to the JSON files, but the formatting for JSON and the text files is the same source
Sounds like you're in general agreement
We'd like to close this out. @sschuberth - the issue you are having really only occurs with a small number of licenses (the vast majority of licenses on the SPDX license list do not have an official, plain text, steward-published origin. That being said, can you list the license for which this would apply, e.g. Apache-2.0 and which others, as I suspect it's a short list.
Keep in mind, that even for the GNU licenses which do have the above, the FSF has changed minor things (like the address) over time, which SPDX does not consider a different license under our matching guidelines. So, not sure how we would decide which address variant is the "canonical" text file we should use?
tagging @swinslow as he's offered to update the files, if/when you give a list here! :)
So, not sure how we would decide which address variant is the "canonical" text file we should use?
I see, makes sense. Thanks for your comment anyway!