scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Failed to scan some files for licenses

Open Kannan-3757 opened this issue 4 months ago • 9 comments

The scanning is not happening for one particular file with .go extension which is having license details. Attached the result JSON for reference and missed path to scan

Package url: https://github.com/etcd-io/etcd/archive/refs/tags/v3.6.4.zip

File path : etcd-3.6.4/i-am-an-other-dual-licensed-file.go

We are using scancode in Linux machine and every time we are downloading the source from git.

Affected ScanCode version: 32.3.3 Affected ScanCode Output Format version: 4.0.0

etcd-io_2025_08_21_163948.json

Kannan-3757 avatar Aug 21 '25 12:08 Kannan-3757

I cannot find that particular file in the archive that was linked.

armijnhemel avatar Aug 21 '25 15:08 armijnhemel

Sorry for the inconvenience. Here is the modified version that used for scanning.

etcd-with-dual-license-3.6.4 (1).zip

Kannan-3757 avatar Aug 22 '25 04:08 Kannan-3757

Is it correct that these files are not part of the actual package, but they were created separately for testing scancode?

armijnhemel avatar Aug 27 '25 18:08 armijnhemel

Yes, we are manually adding some files and scanning it.

Kannan-3757 avatar Sep 01 '25 11:09 Kannan-3757

Yes, we are manually adding some files and scanning it.

This doesn't make sense to me. @pombredanne

armijnhemel avatar Sep 02 '25 12:09 armijnhemel

We are testing our tooling and modifying packages to trigger certain scenarios. This case is about detecting dual licensed files.

CsatariGergely avatar Sep 02 '25 12:09 CsatariGergely

We are testing our tooling and modifying packages to trigger certain scenarios. This case is about detecting dual licensed files.

I can understand the motivation, but I am going to push back a little bit (and @pombredanne can tell me if I am wrong). As far as I can see these license texts are not found in actual packages or even in use anywhere (except for one that I flagged earlier), but they were created specifically to see if scancode would detect these (non-existent) licenses. Because they weren't detected extra rules would need to be added to scancode. Adding these rules would potentially inflict extra costs on every user of scancode (extra memory, extra run time). Is that really worth it?

I mean, we could probably also ask some AI system to create all kinds of variations of license texts and headers and then create rules, even if these licenses are never ever used in real life (except in scancode rules). To me it doesn't make sense to try and cover every possible use case, especially if the licenses are not used (but in the end it is not up to me to decide that :) ).

armijnhemel avatar Sep 02 '25 13:09 armijnhemel

Our policy for adding a license to ScanCode LicenseDB and adding the corresponding detection rules is that the license is in actual use. This is for several reasons including esp. the point that Armijn made about scanning overhead.

In the case of a license not in the LicenseDB, ScanCode should usually return one of the unknown license detections: unknown, free-unknown, unknown-license-reference or unknown-spdx. The purpose is to detect text that looks like a license, but is not registered as ScanCode license.

mjherzog avatar Sep 02 '25 17:09 mjherzog

Understood. The file was scanned, but ScanCode did not detect any of the licenses from the text "This file is licensed under the Apache License 2.0 OR GNU General Public License version 2 license" because it is an artificial license declaration by me.

For me it is okay to close this issue.

CsatariGergely avatar Sep 04 '25 12:09 CsatariGergely