scancode-toolkit
scancode-toolkit copied to clipboard
Do not double scan a file both as a package and as a plain file
This leads to confusing reporting. We should scan first for package and skip scanning as a file when the package manifest is scanned properly.
Though we need to make sure that we do not skip or ignore rare cases such as some Maven POM that use XML comments for their license notice as opposed to use structure license fields.
https://repo1.maven.org/maven2/org/glassfish/javax.json/1.1.4/javax.json-1.1.4.pom
Quoting from #3211:
In general the license of package manifest is best collected with --package that knows about the manifest structure.
I'd even argue that without --package files that would have been recognized with --package should not be scanned at all.
I'd even argue that without --package files that would have been recognized with --package should not be scanned at all.
This would mean that without --package files could be skipped silently. There are too many weird corners that this cannot work reliably at scale IMHO. And even if I were to agree, it would demand running the package scan and then discarding the perfectly correct data that were collected which would be a double whammy: slower scan and incomplete, incorrect detection at the same time. None of these seems to me as tasty outcomes :] .
This would mean that without --package files could be skipped silently.
Yes, we don't want to do this as there are many comments and other clues outside declared license statements which might be valuable. We should instead reconcile/override the results based on cases of these two matching/not-matching.
One thing required is to track line numbers correctly for package license detections, so we can correctly say that the package license detection and the file license detection are coming from the same file location, and then we can override with the package license detection if they are for the same text. Currently this is not there as package license detection only has the context of the extracted_license_statement string.
This is tracked in https://github.com/aboutcode-org/scancode-toolkit/issues/3385