scancode-toolkit
scancode-toolkit copied to clipboard
Consider detected copyrights when determining a declared holder from a package manifest in summary plugin
When scanning the package atheris v 2.0.11
(https://github.com/google/atheris/archive/refs/tags/2.0.11.tar.gz) using the --summary
plugin, the declared_holder
value in the scan summary is Bitshift
, which is the author of the package. This was determined from the parsed package data from the setup.py
file of atheris
. However, the setup.py
contains a comment that is a copyright statement with the actual copyright holders. The summary plugin should be updated to also consider copyrights detected by the copyright scanner. This value should take precedence over authors.
It also may not behoove us to use the package authors as a copyright holder when we do not detect an explicit copyright statement from package data.
@DennisClark @tdruez @pombredanne
When removing the code that assigns the author or other detected parties from a Package as the declared holder, I noticed that the tallies plugins does some sort of normalization on the detected holders from Resources in the codebase. The majority of the files have Google LLC
as the copyright holder, but looking at the summary, only Google, Inc.
shows up as the declared holder. This is done so we are able to group different forms of the same holders together. For example, from https://github.com/nexB/scancode-toolkit/blob/2972-summary-consider-copyrights/src/summarycode/copyright_tallies.py#L487, we normalize google
, google llc
, and google inc
as Google, Inc.
.
Should we remove this normalization of holders to a canonical form? Normalizing and grouping the related holders together helps with getting a good count of how many times a particular holder shows up, especially when there are many different forms of copyright statements for that holder. However, it can become confusing when someone wants to verify the summary results and they cannot find the declared holder in files because the detected holder value was changed.
@JonoYang I am not convinced that using an author value for Holder when there is no copyright detected is a good thing, although I don't feel strongly about it. However, I vaguely recall some community discussion on this topic, where someone strongly asserted that author is NOT equivalent to copyright, so there is definitely a case for not using it at all for a Holder.
As far as "normalizing" the holder goes, it is a nice feature if we can still point back to the original somehow.
@DennisClark
@JonoYang I am not convinced that using an author value for Holder when there is no copyright detected is a good thing, although I don't feel strongly about it. However, I vaguely recall some community discussion on this topic, where someone strongly asserted that author is NOT equivalent to copyright, so there is definitely a case for not using it at all for a Holder.
I've removed the code that uses the Package authors/maintainers as a holder when no copyright is detected.
As far as "normalizing" the holder goes, it is a nice feature if we can still point back to the original somehow.
Maybe we can have a list of the original holder values when we present the tallies of holders?
...
"declared_holder": {
"holder": "Google, Inc.",
"holder_forms": [
"Google LLC",
"Google, Inc."
],
},
"other_holders": [
{
"value": "Fraunhofer FKIE",
"holder_forms": [
"Fraunhofer FKIE"
],
"count": 21
}
],
...
I'm not sure what the best name for that field would be.
After discussion with @pombredanne, it would make sense to just use the company/organization name itself without any of the suffixes. Google, Inc.
, Google LLC
, etc. should just become Google
.
That does make sense for this case, but this Google example seems to be a relatively easy one. There will be many other cases where the relationship among holders is not evident in the names. There is really no way for us to figure this out from a set of copyright holder names beyond these simple cases. What would be interesting is to know the holder best associated with the primary license.