license-maven-plugin icon indicating copy to clipboard operation
license-maven-plugin copied to clipboard

Blacklisting for download licenses

Open ppalaga opened this issue 5 years ago • 8 comments

Mentioned by @srdo in https://github.com/mojohaus/license-maven-plugin/issues/313

Let's clear the following questions:

  1. Should this be a separate Mojo taking licenses.xml file as an imput?

  2. What should be the blacklisting criteria? I guess intuitively, we want to reject some specific licenses, such a GPL in an EPL project. But the question is, how can we reliably identify some licenses.xml entry as GPL?

We have license name, URL and locally stored license text.

License name comes from maven metadata and varies from completely bogus through mistyped to correct but having many variants.

URLs: In an ideal world, licenses are to URLs 1:n. But we can never know all license URLs of a specific license.

License text: Perhaps the most promising option. Would regexing the text be enough?

ppalaga avatar Apr 18 '19 08:04 ppalaga

I am wondering if we could get away with only supporting SPDX identifiers for this? The way I see it, a blacklisting mechanism needs to be reliable, as it is most likely being used to fail the build if a specific license (e.g. GPL v2.0) makes it into the dependency tree. I don't think it is very useful to e.g. ban GPL v2.0 if an artifact with that license may make it into my tree anyway, because someone called it "GPL v2" instead.

I don't see much use in a blacklisting feature that e.g. matches on license names directly, since as you mention, a single license will have lots of variants, and the same holds true of license URLs. I don't think regexing license text works either, as some licenses contain artifact-specific copyright information, e.g. the MIT license often has a copyright bit somewhere in it.

In a perfect world, everyone would move to SPDX identifiers in their POMs immediately, but until then, I think a way to make the blacklisting feature reliable would be to do the following:

  • The blacklisting is based on SPDX identifiers read from dependencies' licenses.license.name fields. The mojo errors if you use a non-SPDX identifier in the blacklist configuration.
  • We allow users to override licenses for artifacts on a per-artifact basis (maybe via the licensesConfigFile we already have?)
  • Any artifact the mojo encounters that doesn't match a known SPDX identifier will cause the build to fail.

The idea is that for artifacts using SPDX identifiers, the blacklist will just work. For anything else, the safest option is to make the user manually check the license, and manually set the right SPDX identifier in the licensesConfigFile. Hopefully SPDX identifiers will become more common, so the manual workload could be lessened over time.

There may need to be an escape hatch too for dependencies that use some custom license that doesn't have an SPDX identifier, maybe let users define a custom set of extra identifiers in licensesConfigFile or elsewhere.

What do you think?

srdo avatar May 10 '19 18:05 srdo

In a perfect world, everyone would move to SPDX identifiers in their POMs immediately

I do not think this is ever going to happen. Having researched a bit, I found no trace of SPDX license IDs being intended to be used in that way. My understanding is that the IDs are primarily meant to be used in source files, rather than annotate whole software packages. See e.g. https://spdx.org/ids .

SPDX Document is the concept intended for software packages (such as Java JARs): https://spdx.org/using-spdx-documents . I think these might be a better bet for us. Esp. because there is a Maven plugin by @goneall to produce them: https://github.com/spdx/spdx-maven-plugin

Here is an example of an SPDX doc: http://central.maven.org/maven2/org/spdx/spdx-tools/2.1.16/spdx-tools-2.1.16.spdx

The funny thing is that SPDX docs do not seem to provide any aggregated license info (is this right, @goneall ?) They just seem to list licenses of all source files and the aggregation would have to be done by us. By aggregation, I mean collecting distinct license IDs occurring in <hasFile> items. This should typically yield a single SPDX license ID. If there are more, I do not see a problem for blacklisting.

Anyway, even if SPDX docs can serve our purpose, we have to figure out how to handle the tons of legacy artifacts, that do not have any associated SPDX doc. I still consider matching the license file text against the SPDX database to be the most reliable strategy.

I think we should start implementing this as a separate mojo that takes licenses.xml and downloaded license files as an input. The download-licenses mojo is complicated enough. A separate mojo will be easier to test.

ppalaga avatar May 13 '19 14:05 ppalaga

@ppalaga Currently, the SPDX maven plugin can aggregate licenses, but it requires configuring the licenses in the POM file for the project which is admittedly a very tedious effort of you have a lot of different licenses.

I have avoided doing any kind of source scanning in the plugin due to the compute time and inaccuracy for most scanners.

After reading this thread, however, I think it would be a very straightforward approach to add a scan for SPDX license identifiers. Parsing for the license ID declarations would be fast and accurate. I'll add an issue on the SPDX maven plugin to track.

One other note on the SPDX maven plugin. It handles dependencies by retrieving the POM files for the dependencies and including them in the SPDX document. If the dependency uses the SPDX maven plugin, you will get high fidelity information on the dependent library licenses. If the dependency does not use the SPDX maven plugin, it will do a best effort to determine the license and include some information based on what is available in the Maven POM file metadata.

goneall avatar May 13 '19 17:05 goneall

@goneall thanks for the details about the SPDX maven plugin.

There is still one thing unclear to me: given a SPDX document like this one: http://central.maven.org/maven2/org/spdx/spdx-tools/2.1.16/spdx-tools-2.1.16.spdx is there any single place in the document saying that the artifact as a whole is Apache 2.0 licensed?

Or I really have to loop over all hasFile (or other) elements and figure out myself that all of them have Apache 2.0 assigned?

ppalaga avatar May 17 '19 09:05 ppalaga

There is an spdx:licenseConcluded and spdx:licenseDeclared associated with the SPDX package itself. The licenseDeclared is what was found in the Maven POM file for the originating package while the licenseConcluded includes the declared license plus any licenses found in the dependencies.

The licenseDeclared from the spdx-tools-2.1.16.spdx file: Apache-2.0

The licenseConcluded from the spdx-tools-2.1.16.spdx file: (MPL-1.0 AND MIT AND LicenseRef-CyberNeko AND LGPL-2.1 AND X11 AND BSD-3-Clause AND Apache-2.0)

Note the LicenseRef-... refers to license text found elsewhere in the SPDX document where any license without a LicenseRef refers to an SPDX listed license.

goneall avatar May 17 '19 17:05 goneall

Thanks for the clarifications, @goneall !

ppalaga avatar May 20 '19 13:05 ppalaga

@ppalaga Sorry for not getting back to this. Thanks for explaining, I misunderstood how SPDX is supposed to work.

My hope would be that the license plugin could get the SPDX document associated with an artifact (not sure if SPDX has a standard for how to publish these on Maven, e.g. with a classifier?). If an artifact already has an SPDX document, we can look at that to derive the list of licenses for an artifact, as you mention.

If an artifact doesn't have an SPDX document associated (as most probably won't), it would be good if users could manually specify the licenses applying to an artifact. Regex matching sounds good as a strategy, but it's nice to be able to override or specify explicitly in case there isn't a match.

srdo avatar Jun 16 '19 07:06 srdo

not sure if SPDX has a standard for how to publish these on Maven, e.g. with a classifier?

I recently discussed this with @goneall : https://github.com/spdx/spdx-maven-plugin/issues/16

If an artifact doesn't have an SPDX document associated (as most probably won't), it would be good if users could manually specify the licenses applying to an artifact.

I think the existing options of DownloadLicensesMojo already cover all we need. The output of the mojo is a licenses.xml file (as complete and manually enhanced as the user wishes) and the downloaded licenses.

ppalaga avatar Jun 19 '19 12:06 ppalaga