cyclonedx-maven-plugin icon indicating copy to clipboard operation
cyclonedx-maven-plugin copied to clipboard

Copyright detection would be amazing

Open ben-spiller opened this issue 1 year ago • 8 comments

Since many opensource licenses (e.g. MIT) require publishing a list of copyright attributions from the dependencies you use, it'd be awesome to have support for detecting copyrights in this tool to populate the CycloneDX "copyright" field and comply with this common requirement.

This could be implemented by using a regex (user-configurable would be great) to detect copyright messages from various standard locations inside the jar (a configurable set of globs) e.g. NOTICES, META-INF/MANIFEST.MF, README.* etc.

Even more amazing would be to do download the associated source jar from mavencentral in case the binary doesn't contain copyrights (but even just binary scanning would be a big win).

ben-spiller avatar Aug 23 '23 16:08 ben-spiller

I'm also very much interested in this. I found https://github.com/JD-CSTx/license-maven-plugin which does exactly what is needed here for a different Maven plug-in. If @JD-CSTx agrees I would volunteer to take his code and try to add it to the cyclonedx-maven-plugin.

sithmein avatar Sep 11 '23 14:09 sithmein

I'm also very much interested in this. I found https://github.com/JD-CSTx/license-maven-plugin which does exactly what is needed here for a different Maven plug-in. If @JD-CSTx agrees I would volunteer to take his code and try to add it to the cyclonedx-maven-plugin.

Of course I agree, also: I couldn't disagree, even if I wanted to. It's a fork of the MojoHaus License Maven Plugin (which was abandon for a long time period), and is under the LPGL 3.0 License: https://www.mojohaus.org/license-maven-plugin/licenses.html.

Master-Code-Programmer avatar Sep 16 '23 11:09 Master-Code-Programmer

I started working on this at https://github.com/sithmein/cyclonedx-maven-plugin/tree/issue-389-copyright-detection . The Maven plug-in has a new configuration parameter extractCopyrights which is false by default. If set to true the plug-in will look into all artifacts' Jar files (binaries and sources) and extract copyright information. I tested it with a project of ours that has ~300 components and the plug-in is able to extract almost all copyright information that I was able to find manually.

This is only a first iteration but you can already give it a try by installing it locally (I bumped the version) and then running the new version on a project.

One open question is about the format when there are multiple copyrights fond. CycloneDX only has a text field for copyright. The plug-in currently joins all found copyrights with semicolon.

sithmein avatar Oct 09 '23 13:10 sithmein

You may to check out ScanCode toolkit (that I co-maintain) for this. This is considered as one of the best-in-class tools for copyright detection. This is in Python, not Java though. https://github.com/nexB/scancode-toolkit/tree/develop/src/cluecode

pombredanne avatar Oct 09 '23 19:10 pombredanne

I already tried it but the result were not really satisfactory. It reported quite a lot of nonsense in our case. And it took waaaay to much time, likely because it looked at each and every file. I don't believe this is necessary, though. If the publisher of an artifact doesn't bother providing copyright information in some usable way you cannot expect from users of that artifact to dig it up themselves by looking at every single file. My - totally non-legal - opinion.

sithmein avatar Oct 09 '23 19:10 sithmein

@sithmein re:

I already tried it but the result were not really satisfactory. It reported quite a lot of nonsense in our case

That's a bug to me then. Do you have you the input you used?

pombredanne avatar Oct 10 '23 17:10 pombredanne

This can't be implemented using regex. Been there and bought the T-shirt. Use a project like Javaparser and parse comment nodes from AST for java. For other files, find a suitable treesitter implementation.

prabhu avatar Oct 13 '23 08:10 prabhu

What do you mean with "it can't be implemented"? Obviously it works. It may not detect any kind of weird copyright notices but I doubt that any other approach will. The question is, what we want to achieve in the end? My goal right now is to extract copyright information that is provided in an obvious and clean way. The goal is not to reimplement Scancode in Java (as an example). Also because we are not working on the sources but on the official (binary) artifacts.

sithmein avatar Oct 13 '23 08:10 sithmein