license-list-XML icon indicating copy to clipboard operation
license-list-XML copied to clipboard

Formatting of plain license text in JSON data is broken

Open goneall opened this issue 7 years ago • 10 comments

Moving issue from SPDX tools. Originally submitted by @sschuberth

At the example of Apache-2.0, when extracting the licenseText string to a file, I'd expect that file to be exactly formatted like the original plain text license including leading spaces and blank lines. However, the JSON string is formatted like

Apache License

Version 2.0, January 2004

http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      

      "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.

      

      "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

      

(note the missing leading spaces but added trailing spaces) which not only does not match the original text but also is quite ugly.

goneall avatar Jun 11 '18 16:06 goneall

The way we are maintaining the license information in the license-list-XML github repository it is not feasible to retain the formatting of the original text since XML removes the white space and we do not have enough tags to retain all of the formatting.

That being said, we could do a better job of formatting the text and making it look prettier.

The code for this has actually moved to a different project: LicenseListPublish

goneall avatar Jun 11 '18 16:06 goneall

The way we are maintaining the license information in the license-list-XML github repository it is not feasible to retain the formatting of the original text

I believe that's exactly the problem then, and XML shouldn't be used as the primary format to store the original text. It could still be used as a format to apply SPDX-specific formatting, however.

sschuberth avatar Jun 11 '18 18:06 sschuberth

+1 to maintaining the original format of the licenses. Currently, for example, these two license texts differ by newlines. https://github.com/spdx/tools/blob/master/resources/stdlicenses/MIT.jsonld https://github.com/OpenSourceOrg/licenses/blob/master/texts/plain/MIT

While it is not a huge deal for consumers to find some wordwrapping implementation and run the text through before, say, generating a NOTICE file, it is extra hassle and will lead to apparent differences. Would be great to generate clarity and simplicity around licenses by using the same canonical form everywhere.

jeffmcaffer avatar Jan 23 '19 23:01 jeffmcaffer

Resolves in PR spdx/LicenseListPublisher#83

goneall avatar Nov 14 '20 04:11 goneall

I'm reopening this to remind myself that the issue hasn't really been fixed yet. While PR spdx/LicenseListPublisher#83 laid the foundation for getting it fixed, https://raw.githubusercontent.com/spdx/license-list-data/b8d6af45ad2fcfed61bb85a8ad068aa4a77eadf9/text/Apache-2.0.txt still does not match https://www.apache.org/licenses/LICENSE-2.0.txt formatting-wise.

IIUC @goneall correctly, the remaining thing to do is to commit the original / upstream plain text licenses to https://github.com/spdx/license-list-XML/tree/master/test/simpleTestForGenerator and then rerun this publisher to make the correct licenses show up at https://github.com/spdx/license-list-data/tree/master/text. I'll try to wrote a script for that to finally resolve this long-stand issue.

sschuberth avatar Jan 09 '22 15:01 sschuberth

@sschuberth - Just going through the older issue. Any thoughts or progress on updating the text in the license-list-XML repo?

goneall avatar Apr 10 '23 00:04 goneall

Sorry @goneall, this issue has slipped my mind. But would you agree that the mentioned approach is the way to go:

the remaining thing to do is to commit the original / upstream plain text licenses to https://github.com/spdx/license-list-XML/tree/master/test/simpleTestForGenerator and then rerun this publisher to make the correct licenses show up at https://github.com/spdx/license-list-data/tree/master/text.

sschuberth avatar Apr 10 '23 11:04 sschuberth

@sschuberth I agree with the above approach.

I'll move this issue over to the license-list-XML repo since this is where the work will be done.

@swinslow @jlovejoy FYI - if you disagree with updating the test text to fix the formatting in JSON, please add to this issue and cc @sschuberth

goneall avatar Apr 10 '23 16:04 goneall

@sschuberth @goneall - I'm not sure I'm following the implementation details here, but I think the goal is to get to a point to where the text files at https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator are "formatted" to look or reflect any original text file for a given license (e.g, https://www.apache.org/licenses/LICENSE-2.0.txt ) or at least has some form of line length limit to avoid horizontal scrolling?

if we do that, then the formatting will show up better at https://github.com/spdx/license-list-data/tree/master/text.

is that right-ish?

I'm all in favor of better formatting such that people can "reuse" text files. I think we need to document which text file directory is the best to use as well.

Also, keep in mind that the text files created in https://github.com/spdx/license-list-XML/tree/main/test/simpleTestForGenerator are created as part of the PR when the license is accepted to the SPDX License List. We have a GSoC project that would add functionality to create this text file automatically via the online submission tool, instead of people having to create it manually. So, any formatting parameters should be included for that project.

jlovejoy avatar Apr 11 '23 23:04 jlovejoy

@jlovejoy

if we do that, then the formatting will show up better at https://github.com/spdx/license-list-data/tree/master/text.

Close - the specific issue is related to the JSON files, but the formatting for JSON and the text files is the same source

Sounds like you're in general agreement

goneall avatar Apr 12 '23 05:04 goneall

We'd like to close this out. @sschuberth - the issue you are having really only occurs with a small number of licenses (the vast majority of licenses on the SPDX license list do not have an official, plain text, steward-published origin. That being said, can you list the license for which this would apply, e.g. Apache-2.0 and which others, as I suspect it's a short list.

Keep in mind, that even for the GNU licenses which do have the above, the FSF has changed minor things (like the address) over time, which SPDX does not consider a different license under our matching guidelines. So, not sure how we would decide which address variant is the "canonical" text file we should use?

tagging @swinslow as he's offered to update the files, if/when you give a list here! :)

jlovejoy avatar Aug 15 '25 18:08 jlovejoy

So, not sure how we would decide which address variant is the "canonical" text file we should use?

I see, makes sense. Thanks for your comment anyway!

sschuberth avatar Aug 25 '25 13:08 sschuberth