scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Consider detected copyrights when determining a declared holder from a package manifest in summary plugin

Open JonoYang opened this issue 2 years ago • 6 comments

When scanning the package atheris v 2.0.11 (https://github.com/google/atheris/archive/refs/tags/2.0.11.tar.gz) using the --summary plugin, the declared_holder value in the scan summary is Bitshift, which is the author of the package. This was determined from the parsed package data from the setup.py file of atheris. However, the setup.py contains a comment that is a copyright statement with the actual copyright holders. The summary plugin should be updated to also consider copyrights detected by the copyright scanner. This value should take precedence over authors.

JonoYang avatar May 19 '22 19:05 JonoYang

It also may not behoove us to use the package authors as a copyright holder when we do not detect an explicit copyright statement from package data.

JonoYang avatar May 19 '22 19:05 JonoYang

@DennisClark @tdruez @pombredanne

When removing the code that assigns the author or other detected parties from a Package as the declared holder, I noticed that the tallies plugins does some sort of normalization on the detected holders from Resources in the codebase. The majority of the files have Google LLC as the copyright holder, but looking at the summary, only Google, Inc. shows up as the declared holder. This is done so we are able to group different forms of the same holders together. For example, from https://github.com/nexB/scancode-toolkit/blob/2972-summary-consider-copyrights/src/summarycode/copyright_tallies.py#L487, we normalize google, google llc, and google inc as Google, Inc..

Should we remove this normalization of holders to a canonical form? Normalizing and grouping the related holders together helps with getting a good count of how many times a particular holder shows up, especially when there are many different forms of copyright statements for that holder. However, it can become confusing when someone wants to verify the summary results and they cannot find the declared holder in files because the detected holder value was changed.

JonoYang avatar May 19 '22 23:05 JonoYang

@JonoYang I am not convinced that using an author value for Holder when there is no copyright detected is a good thing, although I don't feel strongly about it. However, I vaguely recall some community discussion on this topic, where someone strongly asserted that author is NOT equivalent to copyright, so there is definitely a case for not using it at all for a Holder.

As far as "normalizing" the holder goes, it is a nice feature if we can still point back to the original somehow.

DennisClark avatar May 19 '22 23:05 DennisClark

@DennisClark

@JonoYang I am not convinced that using an author value for Holder when there is no copyright detected is a good thing, although I don't feel strongly about it. However, I vaguely recall some community discussion on this topic, where someone strongly asserted that author is NOT equivalent to copyright, so there is definitely a case for not using it at all for a Holder.

I've removed the code that uses the Package authors/maintainers as a holder when no copyright is detected.

As far as "normalizing" the holder goes, it is a nice feature if we can still point back to the original somehow.

Maybe we can have a list of the original holder values when we present the tallies of holders?

    ...
    "declared_holder": {
        "holder": "Google, Inc.",
        "holder_forms": [
          "Google LLC",
          "Google, Inc."
        ],
    },
    "other_holders": [
      {
        "value": "Fraunhofer FKIE",
        "holder_forms": [
          "Fraunhofer FKIE"
        ],
        "count": 21
      }
    ],
    ...

I'm not sure what the best name for that field would be.

JonoYang avatar May 20 '22 00:05 JonoYang

After discussion with @pombredanne, it would make sense to just use the company/organization name itself without any of the suffixes. Google, Inc., Google LLC, etc. should just become Google.

JonoYang avatar May 24 '22 19:05 JonoYang

That does make sense for this case, but this Google example seems to be a relatively easy one. There will be many other cases where the relationship among holders is not evident in the names. There is really no way for us to figure this out from a set of copyright holder names beyond these simple cases. What would be interesting is to know the holder best associated with the primary license.

mjherzog avatar May 24 '22 19:05 mjherzog