license-list-XML icon indicating copy to clipboard operation
license-list-XML copied to clipboard

Canonical license texts ?

Open zvr opened this issue 3 years ago • 13 comments

Do we want to keep a "canonical" text for licenses that have one?

The question has been raised in the past, and is currently triggering #1396.

I would be in favor of having another directory with canonical texts, if they exist. I would not be in favor using the test files for this purpose.

What do others think?

zvr avatar Apr 28 '22 20:04 zvr

Also see the related discussion at https://github.com/spdx/LicenseListPublisher/issues/30#issuecomment-457012678.

sschuberth avatar Apr 28 '22 21:04 sschuberth

This has also come up in the context of FSFE's REUSE tooling, and their desire to use the test files as source content for license texts. There's a mildly-related discussion in the long thread at https://lists.spdx.org/g/Spdx-legal/topic/88334638#3095

My two cents:

  • I don't personally think the test files should be used for this purpose
  • If other people want to use the test files for this purpose anyway, that's fine, they're more than welcome to
  • I could see some possible value in having a list somewhere of "canonical" texts, in the small subset where that's actually possible (note that I do think this is a small number)
  • Anyone who does want to do that, though, should be very conscious of @jlovejoy's comment here noting that even the most common licenses have had their "canonical" text changed by the upstream license stewards from time to time
  • If there is going to be a collection of "canonical" texts maintained by SPDX, I don't know that it should be in license-list-XML. This repo is really about being the upstream input to the License List website and license-list-data repo, and the use case for "canonical" texts isn't really part of that.

swinslow avatar Apr 28 '22 21:04 swinslow

@swinslow to your last point: people should consume them from https://github.com/spdx/license-list-data. However, the current setup is that everything in that repo is generated based on this one.

We can obviously create another repo (e.g., spdx/canonical-texts) instead of a directory in this repo; this might be cleaner.

zvr avatar Apr 28 '22 21:04 zvr

Reference https://github.com/spdx/license-list-XML/pull/1396#issuecomment-1112709802

Perhaps we should have a real-time discussion on this to finally decide what the solution is? I can support any reasonable solution with changes in the LicenseListPublisher.

goneall avatar Apr 28 '22 22:04 goneall

We can obviously create another repo (e.g., spdx/canonical-texts) instead of a directory in this repo; this might be cleaner.

I tend to agree to that. Plus, that separate repo could run a GitHub action to continuously crawl the locations for the canonical license text and auto-commit them, so we'd get the diffs if the upstream license text ever changes.

sschuberth avatar Apr 29 '22 06:04 sschuberth

Just to level-set, what are the criteria that you're picturing would be used to determine whether a "canonical" text version exists for a given license?

I'd assume something like all of the following, if the intention is to claim that this is a byte-for-byte "canonical" version of the license:

  • there is a universally-acknowledged license steward for that license
  • the steward has published exactly one version of that license
  • the steward has published it in plain-text format, in a standalone file with no other content
  • the license does not include any "templating" or "replaceable text" / "fill in your copyright notice here"

There might be other criteria, but that's what comes to mind offhand.

If so, do we have a guess at what percentage of the License List would actually fall into this category? Skimming through the list and making some assumptions, I'd guess maybe the CC licenses, probably some or all of the GNU licenses (though I know GNU has changed their content from time to time), some of the others here and there. I'd guess it's significantly less than a majority of what's on the License List.

I would not be in favor of putting anything inside of a "canonical texts" repo that isn't official according to the accepted upstream steward for that license. For example, for the MIT license, MIT is not actually the steward and there's lots of replaceable text, so I assume nothing would be included in the "canonical texts" repo for it. I suspect there's a lot of similar, widely-used licenses that would fall into that category.

swinslow avatar Apr 29 '22 10:04 swinslow

there is a universally-acknowledged license steward for that license

If "steward" here is not limited to a person, but it could also be an organization / foundation, I'd agree.

the steward has published exactly one version of that license

That depends on what you count as a "version". E.g. Apache (formally) has versions 1.1 and 2.0, so that's (at least) two versions "of that license".

Also, do you count different file formats of the same text as different versions? To me, "canonical" is specific to the file format. Like, there could be each a canonical text, PDF, etc. version of a specific license.

the steward has published it in plain-text format, in a standalone file with no other content

I basically agree, but as to me "canonical" is a file-format-specific thing, it's not necessarily limited to plain-text.

the license does not include any "templating" or "replaceable text" / "fill in your copyright notice here"

That would not be a criteria for me. E.g. https://www.apache.org/licenses/LICENSE-2.0.txt does contain an appendix about how the license should be applied (incl. placeholders), but I still regard it as the canonical license.

sschuberth avatar Apr 29 '22 10:04 sschuberth

Is the proposal to: A) maintain a separate repo which consumers would access directly B) maintain a separate repo which would be the source data for canonical text being which would be copied in the license-list-data repo which would also be available for access through the API's C) all of the above

goneall avatar Apr 29 '22 12:04 goneall

Just FYI - one of the recent GSoC projects implemented a license text scraper in the LicenseListPublisher for the purpose of verifying the license URL's. Some of that code could be leveraged for this purpose.

The code can be found here: https://github.com/spdx/LicenseListPublisher/tree/master/src/org/spdx/crossref

goneall avatar Apr 29 '22 13:04 goneall

How can you have a BSD-canonical license? There's no license steward, the text has a huge number of variations (even when you omit the ones that talk about the voices in Bill Paul's head) and the 'original' isn't at all templated and uses terms that are specific to a tape distribution of a known version which fit less well to the continuous release that all open source projects with SCMs facing the internet do. At best we can have a constructed after the fact idealized license for this class of licenses. And it's a large an important class, not some obscure back water of open source.

I'd love to have this, as it makes it my job of having files with only the SDPX License Expression to indirectly refer to the license a lot easier to explain in our policy documents (which is required, imho, to create the legal contract (or whatever the right word is for a one-sided grant) by making it clear what that license grant is).

The nuts and bolts of having it in a separate repo, apis to access it, etc are interesting. I rather like that too, but I'm stumbling on 'canonical' to describe it. At best we can get is more of a 'specimen' which is as representative a license as we can get that's as generic as possible that would certainly be more than adequate to drive whatever testing use case prompted this request.

bsdimp avatar Apr 29 '22 13:04 bsdimp

As a gut instinct, I feel strongly against a new, separate repo for this. That is another thing to maintain and therefore have criteria around etc. for the reasons already stated, is going to be more challenging that it seems.

Based on previous discussions, it seems like we got to a point of 1) recommending against using the text files in this repo for this purpose; and 2) pointing people to something either a) already in the license-list-data repo; or b) something to-be-created in the license-list-data repo.

I'd strongly recommend we pick up there and, as @goneall suggests, perhaps try out using some iteration of what has been discussed recently in terms of identifying some key aspects in terms of: 1) what is the issue to be solved; 2) how does it fit with the SPDX mission/vision; and 3) is this something we should/have time/will solve (and then if so, how) is solving and

jlovejoy avatar Jun 08 '22 20:06 jlovejoy

also discussed at #1575

jlovejoy avatar Aug 22 '22 02:08 jlovejoy