scancode-toolkit
scancode-toolkit copied to clipboard
Improve license detection for wrong SPDX license identifiers
Consider the following text:
SPDX-License-Identifier: (GPL-2.0+ OR BSD)
Here BSD is not a valid license expression and even adding a rule is insufficient because the SPDX-License-Identifier based detection was moved before the hash license detection.
We should either:
- do the hash license detection first so we can catch these with rules, and then do the SPDX identifier based detection
- if we get unknown-spdx we consider license detection with rules
- Also optionally consider license detection with required phrase rules if nothing works (would lose license expression info for this potentially)?
create a rule for gpl-2.0-plus AND bsd-new with this text
SPDX-License-Identifier: (GPL-2.0+ OR BSD)
and make this 99 relevant
that's the approach for BSD's that will be picked over the SPDX detection, it should at least
Here are examples https://github.com/search?q="SPDX-License-Identifier%3A+(GPL-2.0%2B+OR+BSD)"&type=code and
https://github.com/BPI-SINOVOIP/BPI-R2PRO-BSP/blob/938b4b14d8ee8e332a6cf04111a11d9a95156a6d/kernel/include/dt-bindings/reset/amlogic%2Cmeson-axg-reset.h#L9
I pushed a fix in https://github.com/aboutcode-org/scancode-toolkit/pull/3905/commits/c581828c12c5b692f9b0c080f4da07b9e014285f
The default sort order or LicenseMatch was based on the "matcher" string, hence "1-spdx-id" would always beat a "2-aho" match. Now we have a new "matcher_order" integer attribute that is used to sort instead and the hash and aho always take precedence over SPDX.