scancode-toolkit
scancode-toolkit copied to clipboard
Many duplicates in SPDX files
Description
In the SPDX code, we have multiple times the same code, for example:
LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
or
# File
FileName: ./tern.original/LICENSE.txt
SPDXID: SPDXRef-7
FileChecksum: SHA1: 5ec0910f78578a5df32b56cae953249d45d0dd5b
LicenseConcluded: NOASSERTION
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: OFL-1.1
FileCopyrightText: <text>Copyright (c) 2017 VMware, Inc.
</text>
I do not know if it is really a bug, but is is at least confusing.
How To Reproduce
scancode -c -l -i --spdx-tv tern.spdx /home/vargenau/git/tern.original/
scancode -c -l -i --spdx-tv reuse.spdx /home/vargenau/git/reuse/
where the code comes from GitHub:
https://github.com/tern-tools/tern
https://github.com/fsfe/reuse-tool
System configuration
- What OS are you running on? Ubuntu 21.10
- What version of scancode-toolkit was used to generate the scan file? ScanCode version 30.1.0
- What installation method was used to install/run scancode? pip
@vargenau Thank you for the report!
You wrote:
In the SPDX code, we have multiple times the same code, for example: ...
I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?
It is always best to run with --license-text with an SPDX output to ensure that the extracted text is populated.
Otherwise, you have a boilerplate ExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>
I ran a single scan on https://raw.githubusercontent.com/tern-tools/tern/cdc6732eda7de1e5e1f9e1298a6db2e073ec48fc/LICENSE.txt
Of note:
- the text is damaged with mojibake. This is eventually making matching a bit less accurate
$ chardet3 LICENSE.txt
LICENSE.txt: windows-1252 with confidence 0.73
$ file LICENSE.txt
LICENSE.txt: Non-ISO extended-ASCII text, with very long lines, with CRLF line terminators
- the license detection could be better and return only two matches rather than three
- we are working on refinements to eventually merge multiple matches in a single detection
NB: If you are interested in container scans, check out also the companion server project http://scancode.io/
headers:
- tool_name: scancode-toolkit
tool_version: 31.0.0b1
options:
input:
- LICENSE.txt
--license: yes
--license-text: yes
--license-text-diagnostics: yes
--yaml: '-'
notice: |
Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
OR CONDITIONS OF ANY KIND, either express or implied. No content created from
ScanCode should be considered or used as legal advice. Consult an Attorney
for any legal advice.
ScanCode is a free software code scanning tool from nexB Inc. and others.
Visit https://github.com/nexB/scancode-toolkit/ for support and download.
start_timestamp: '2022-03-31T185132.281914'
end_timestamp: '2022-03-31T185135.025805'
output_format_version: 2.0.0
duration: '2.743912696838379'
message:
errors: []
extra_data:
spdx_license_list_version: '3.16'
files_count: 1
files:
- path: LICENSE.txt
type: file
licenses:
- key: bsd-simplified
score: '100.0'
name: BSD-2-Clause
short_name: BSD-2-Clause
category: Permissive
is_exception: no
is_unknown: no
owner: Regents of the University of California
homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
text_url: http://opensource.org/licenses/bsd-license.php
reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
spdx_license_key: BSD-2-Clause
spdx_url: https://spdx.org/licenses/BSD-2-Clause
start_line: 5
end_line: 5
matched_rule:
identifier: bsd-simplified_226.RULE
license_expression: bsd-simplified
licenses:
- bsd-simplified
referenced_filenames: []
is_license_text: no
is_license_notice: yes
is_license_reference: no
is_license_tag: no
is_license_intro: no
has_unknown: no
matcher: 2-aho
rule_length: 13
matched_length: 13
match_coverage: '100.0'
rule_relevance: 100
matched_text: "The BSD-2 license (the \x93License\x94) set forth below applies\
\ to all parts"
- key: bsd-simplified
score: '50.0'
name: BSD-2-Clause
short_name: BSD-2-Clause
category: Permissive
is_exception: no
is_unknown: no
owner: Regents of the University of California
homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
text_url: http://opensource.org/licenses/bsd-license.php
reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
spdx_license_key: BSD-2-Clause
spdx_url: https://spdx.org/licenses/BSD-2-Clause
start_line: 5
end_line: 5
matched_rule:
identifier: bsd-simplified_275.RULE
license_expression: bsd-simplified
licenses:
- bsd-simplified
referenced_filenames: []
is_license_text: no
is_license_notice: yes
is_license_reference: no
is_license_tag: no
is_license_intro: no
has_unknown: no
matcher: 3-seq
rule_length: 28
matched_length: 14
match_coverage: '50.0'
rule_relevance: 100
matched_text: "License\x94) [set] [forth] [below] [applies] [to] [all] [parts]\
\ [of] [the] [Tern] project. You may not use this file except in compliance\
\ with the License."
- key: bsd-simplified
score: '100.0'
name: BSD-2-Clause
short_name: BSD-2-Clause
category: Permissive
is_exception: no
is_unknown: no
owner: Regents of the University of California
homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
text_url: http://opensource.org/licenses/bsd-license.php
reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
spdx_license_key: BSD-2-Clause
spdx_url: https://spdx.org/licenses/BSD-2-Clause
start_line: 5
end_line: 7
matched_rule:
identifier: bsd-simplified_53.RULE
license_expression: bsd-simplified
licenses:
- bsd-simplified
referenced_filenames: []
is_license_text: no
is_license_notice: no
is_license_reference: yes
is_license_tag: no
is_license_intro: no
has_unknown: no
matcher: 2-aho
rule_length: 3
matched_length: 3
match_coverage: '100.0'
rule_relevance: 100
matched_text: "License. \r\n\r\nBSD-2"
- key: bsd-simplified
score: '100.0'
name: BSD-2-Clause
short_name: BSD-2-Clause
category: Permissive
is_exception: no
is_unknown: no
owner: Regents of the University of California
homepage_url: http://www.opensource.org/licenses/BSD-2-Clause
text_url: http://opensource.org/licenses/bsd-license.php
reference_url: https://scancode-licensedb.aboutcode.org/bsd-simplified
scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.LICENSE
scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/bsd-simplified.yml
spdx_license_key: BSD-2-Clause
spdx_url: https://spdx.org/licenses/BSD-2-Clause
start_line: 7
end_line: 12
matched_rule:
identifier: bsd-simplified_169.RULE
license_expression: bsd-simplified
licenses:
- bsd-simplified
referenced_filenames: []
is_license_text: yes
is_license_notice: no
is_license_reference: no
is_license_tag: no
is_license_intro: no
has_unknown: no
matcher: 2-aho
rule_length: 184
matched_length: 184
match_coverage: '100.0'
rule_relevance: 100
matched_text: "License \r\n\r\nRedistribution and use in source and binary forms,\
\ with or without modification, are permitted provided that the following\
\ conditions are met:\r\n\x95\tRedistributions of source code must retain\
\ the above copyright notice, this list of conditions and the following\
\ disclaimer.\r\n\x95\tRedistributions in binary form must reproduce the\
\ above copyright notice, this list of conditions and the following disclaimer\
\ in the documentation and/or other materials provided with the distribution.\r\
\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"\
AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED\
\ TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\
\ PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS\
\ BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR\
\ CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE\
\ GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)\
\ HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,\n WHETHER IN CONTRACT,\
\ STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING\
\ IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY\
\ OF SUCH DAMAGE."
license_expressions:
- bsd-simplified
- bsd-simplified
- bsd-simplified
- bsd-simplified
percentage_of_license_text: '94.64'
scan_errors: []
wrt. to reuse: https://github.com/fsfe/reuse-tool/tree/master/src/reuse/resources contains long lists of SPDX licenses that are real license mentions but false positives since this is a tool that is license-related. The latest develop branch has several fixes in this area and many more planned in #2878 but this is still showing up in this case.
In general, note that ScanCode is not optimized to scan tools that are themselves license detection tools, so you can expect a lot of matches in these cases.
@vargenau Thank you for the report!
You wrote:
In the SPDX code, we have multiple times the same code, for example: ...
I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?
It is always best to run with
--license-textwith an SPDX output to ensure that the extracted text is populated. Otherwise, you have a boilerplateExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>
Hi Philippe,
Sorry if I was not clear.
In the tern.spdx SPDX file, you have the following:
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
LicenseInfoInFile: BSD-2-Clause
(see bigger extract in the initial report)
Why do we have the same information 3 times for the same file?
@vargenau Thank you for the report!
You wrote:
In the SPDX code, we have multiple times the same code, for example: ...
I assume here (may be incorrectly) that when you say " we have multiple times the same code" you mean that the "Se details ... " comment is showing up twice?
It is always best to run with
--license-textwith an SPDX output to ensure that the extracted text is populated. Otherwise, you have a boilerplateExtractedText: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml </text>
Hi Philippe,
As you recommended, I have used the --license-text option.
I now get:
LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>licensed under the</text>
LicenseID: LicenseRef-scancode-unknown-license-reference
LicenseName: Unknown License file reference
LicenseComment: <text>See details at https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
</text>
ExtractedText: <text>see [LICENSE.txt](</text>
So we have the same LicenseID with a different ExtractedText. This seems illegal for me.
The SPDX spec says: Provide a locally unique identifier to refer to licenses that are not found on the SPDX License List.
What do you think?
So we have the same LicenseID with a different ExtractedText. This seems illegal for me.
The SPDX spec says: Provide a locally unique identifier to refer to licenses that are not found on the SPDX License List.
This "scancode" as used in the LicenseRef is called a license namespace and this is registered here https://tools.spdx.org/app/archive_namespace_requests/2/
So this is global and not local.
unknown-license-reference is a special case where ScanCode detects elements of what may be a license. The LicenseID values for unknown-license detections are generated for consistency in the output data - not for use in an SPDX document. There is major rework pending on the handling of unknown-license-reference - see also https://github.com/nexB/scancode-toolkit/issues/2878
@vargenau I am revisiting this as we start some major work on false positive:
- You are scanning tern and reuse and they have quite a few licenses in them making it not the best example as they are license tools. That said we should still scan them correctly
- There is work on a the new thing called "License Detection" to eventually group multiple license matches in a detection that should generally help cope with some of the issues you brought up here
@rnjudge see https://raw.githubusercontent.com/tern-tools/tern/cdc6732eda7de1e5e1f9e1298a6db2e073ec48fc/LICENSE.txt which is your damaged and not-really-standard license text and notice. The main issue is mojibake
@pombredanne that file comes directly from GitHub when you choose a license for the project. Do you have a suggestion for a more parse-able/standard license text we can use to communicate BSD-2?
Thanks for bringing this to my attention, I wasn't aware. Happy to update!
@rnjudge you wrote:
that file comes directly from GitHub when you choose a license for the project. Do you have a suggestion for a more parse-able/standard license text we can use to communicate BSD-2?
It may have bee this way, but this seems to be no longer the case: https://raw.githubusercontent.com/pombredanne/test-bsd2/main/LICENSE
Any BSD text that scancode detects works! (I will be adding yours as a new rule FWIW) .... using https://scancode-licensedb.aboutcode.org/bsd-simplified.html will surely work perfectly . This https://opensource.org/licenses/bsd-license.php and this too https://spdx.org/licenses/BSD-2-Clause will be fine.
Thanks @pombredanne. The license file in Tern was created 5 years ago so it's good you're bringing this up. I opened a PR to fix this in Tern. Could you have a look?