scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Many duplicates in SPDX files

Open vargenau opened this issue 3 years ago • 11 comments

Description

In the SPDX code, we have multiple times the same code, for example:

LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>

or

# File

FileName: ./tern.original/LICENSE.txt
SPDXID: SPDXRef-7
FileChecksum: SHA1: 5ec0910f78578a5df32b56cae953249d45d0dd5b
LicenseConcluded: NOASSERTION
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: OFL-1.1
FileCopyrightText: <text>Copyright (c) 2017 VMware, Inc.
</text>

I do not know if it is really a bug, but is is at least confusing.

reuse.spdx.txt tern.spdx.txt

How To Reproduce

scancode -c -l -i --spdx-tv tern.spdx /home/vargenau/git/tern.original/
scancode -c -l -i --spdx-tv reuse.spdx /home/vargenau/git/reuse/

where the code comes from GitHub:

https://github.com/tern-tools/tern
https://github.com/fsfe/reuse-tool

System configuration

  • What OS are you running on? Ubuntu 21.10
  • What version of scancode-toolkit was used to generate the scan file? ScanCode version 30.1.0
  • What installation method was used to install/run scancode? pip

vargenau avatar Mar 31 '22 17:03 vargenau

@vargenau Thank you for the report!

You wrote:

In the SPDX code, we have multiple times the same code, for example: ...

I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?

It is always best to run with --license-text with an SPDX output to ensure that the extracted text is populated. Otherwise, you have a boilerplate ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>

pombredanne avatar Mar 31 '22 18:03 pombredanne

I ran a single scan on https://raw.githubusercontent.com/tern-tools/tern/cdc6732eda7de1e5e1f9e1298a6db2e073ec48fc/LICENSE.txt

Of note:

  1. the text is damaged with mojibake. This is eventually making matching a bit less accurate
$ chardet3 LICENSE.txt 
LICENSE.txt: windows-1252 with confidence 0.73
$ file LICENSE.txt 
LICENSE.txt: Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators
  1. the license detection could be better and return only two matches rather than three
  2. we are working on refinements to eventually merge multiple matches in a single detection

NB: If you are interested in container scans, check out also the companion server project http://scancode.io/

headers:
    -   tool_name: scancode-toolkit
        tool_version: 31.0.0b1
        options:
            input:
                - LICENSE.txt
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2022-03-31T185132.281914'
        end_timestamp: '2022-03-31T185135.025805'
        output_format_version: 2.0.0
        duration: '2.743912696838379'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.16'
            files_count: 1
files:
    -   path: LICENSE.txt
        type: file
        licenses:
            -   key: bsd-simplified
                score: '100.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 5
                end_line: 5
                matched_rule:
                    identifier: bsd-simplified_226.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: yes
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 13
                    matched_length: 13
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: "The BSD-2 license (the \x93License\x94) set forth below applies\
                    \ to all parts"
            -   key: bsd-simplified
                score: '50.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 5
                end_line: 5
                matched_rule:
                    identifier: bsd-simplified_275.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: yes
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 3-seq
                    rule_length: 28
                    matched_length: 14
                    match_coverage: '50.0'
                    rule_relevance: 100
                matched_text: "License\x94) [set] [forth] [below] [applies] [to] [all] [parts]\
                    \ [of] [the] [Tern] project.  You may not use this file except in compliance\
                    \ with the License."
            -   key: bsd-simplified
                score: '100.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 5
                end_line: 7
                matched_rule:
                    identifier: bsd-simplified_53.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 3
                    matched_length: 3
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: "License. \r\n\r\nBSD-2"
            -   key: bsd-simplified
                score: '100.0'
                name: BSD-2-Clause
                short_name: BSD-2-Clause
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Regents of the University of California
                homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
                text_url: http://opensource.org/licenses/bsd-license.php
                reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
                spdx_license_key: BSD-2-Clause
                spdx_url: https://spdx.org/licenses/BSD-2-Clause
                start_line: 7
                end_line: 12
                matched_rule:
                    identifier: bsd-simplified_169.RULE
                    license_expression: bsd-simplified
                    licenses:
                        - bsd-simplified
                    referenced_filenames: []
                    is_license_text: yes
                    is_license_notice: no
                    is_license_reference: no
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: no
                    matcher: 2-aho
                    rule_length: 184
                    matched_length: 184
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: "License \r\n\r\nRedistribution and use in source and binary forms,\
                    \ with or without modification, are permitted provided that the following\
                    \ conditions are met:\r\n\x95\tRedistributions of source code must retain\
                    \ the above copyright notice, this list of conditions and the following\
                    \ disclaimer.\r\n\x95\tRedistributions in binary form must reproduce the\
                    \ above copyright notice, this list of conditions and the following disclaimer\
                    \ in the documentation and/or other materials provided with the distribution.\r\
                    \nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"\
                    AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED\
                    \ TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\
                    \ PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS\
                    \ BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR\
                    \ CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE\
                    \ GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)\
                    \ HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,\n WHETHER IN CONTRACT,\
                    \ STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING\
                    \ IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY\
                    \ OF SUCH DAMAGE."
        license_expressions:
            - bsd-simplified
            - bsd-simplified
            - bsd-simplified
            - bsd-simplified
        percentage_of_license_text: '94.64'
        scan_errors: []

pombredanne avatar Mar 31 '22 19:03 pombredanne

wrt. to reuse: https://github.com/fsfe/reuse-tool/tree/master/src/reuse/resources contains long lists of SPDX licenses that are real license mentions but false positives since this is a tool that is license-related. The latest develop branch has several fixes in this area and many more planned in #2878 but this is still showing up in this case.

In general, note that ScanCode is not optimized to scan tools that are themselves license detection tools, so you can expect a lot of matches in these cases.

pombredanne avatar Mar 31 '22 19:03 pombredanne

@vargenau Thank you for the report!

You wrote:

In the SPDX code, we have multiple times the same code, for example: ...

I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?

It is always best to run with --license-text with an SPDX output to ensure that the extracted text is populated. Otherwise, you have a boilerplate ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>

Hi Philippe,

Sorry if I was not clear.

In the tern.spdx SPDX file, you have the following:

LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause

(see bigger extract in the initial report)

Why do we have the same information 3 times for the same file?

vargenau avatar Apr 01 '22 11:04 vargenau

@vargenau Thank you for the report!

You wrote:

In the SPDX code, we have multiple times the same code, for example: ...

I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?

It is always best to run with --license-text with an SPDX output to ensure that the extracted text is populated. Otherwise, you have a boilerplate ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>

Hi Philippe,

As you recommended, I have used the --license-text option.

I now get:

LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>licensed under the</text>

LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>see [LICENSE.txt](</text>

So we have the same LicenseID with a different ExtractedText. This seems illegal for me.

The SPDX spec says: Provide a locally unique identifier to refer to licenses that are not found on the SPDX License List.

What do you think?

vargenau avatar Apr 01 '22 13:04 vargenau

So we have the same LicenseID with a different ExtractedText. This seems illegal for me.

The SPDX spec says: Provide a locally unique identifier to refer to licenses that are not found on the SPDX License List.

This "scancode" as used in the LicenseRef is called a license namespace and this is registered here https://tools.spdx.org/app/archive_namespace_requests/2/ So this is global and not local.

pombredanne avatar Apr 01 '22 16:04 pombredanne

unknown-license-reference is a special case where ScanCode detects elements of what may be a license. The LicenseID values for unknown-license detections are generated for consistency in the output data - not for use in an SPDX document. There is major rework pending on the handling of unknown-license-reference - see also https://github.com/nexB/scancode-toolkit/issues/2878

mjherzog avatar Apr 01 '22 16:04 mjherzog

@vargenau I am revisiting this as we start some major work on false positive:

  • You are scanning tern and reuse and they have quite a few licenses in them making it not the best example as they are license tools. That said we should still scan them correctly
  • There is work on a the new thing called "License Detection" to eventually group multiple license matches in a detection that should generally help cope with some of the issues you brought up here

@rnjudge see https://raw.githubusercontent.com/tern-tools/tern/cdc6732eda7de1e5e1f9e1298a6db2e073ec48fc/LICENSE.txt which is your damaged and not-really-standard license text and notice. The main issue is mojibake

pombredanne avatar May 12 '22 08:05 pombredanne

@pombredanne that file comes directly from GitHub when you choose a license for the project. Do you have a suggestion for a more parse-able/standard license text we can use to communicate BSD-2?

Thanks for bringing this to my attention, I wasn't aware. Happy to update!

rnjudge avatar May 12 '22 15:05 rnjudge

@rnjudge you wrote:

that file comes directly from GitHub when you choose a license for the project. Do you have a suggestion for a more parse-able/standard license text we can use to communicate BSD-2?

It may have bee this way, but this seems to be no longer the case: https://raw.githubusercontent.com/pombredanne/test-bsd2/main/LICENSE

Any BSD text that scancode detects works! (I will be adding yours as a new rule FWIW) .... using https://scancode-licensedb.aboutcode.org/bsd-simplified.html will surely work perfectly . This https://opensource.org/licenses/bsd-license.php and this too https://spdx.org/licenses/BSD-2-Clause will be fine.

pombredanne avatar May 12 '22 17:05 pombredanne

Thanks @pombredanne. The license file in Tern was created 5 years ago so it's good you're bringing this up. I opened a PR to fix this in Tern. Could you have a look?

rnjudge avatar May 13 '22 04:05 rnjudge