scancode-toolkit icon indicating copy to clipboard operation
scancode-toolkit copied to clipboard

Improve license detection for wrong SPDX license identifiers

Open AyanSinhaMahapatra opened this issue 1 year ago • 3 comments

Consider the following text:

SPDX-License-Identifier: (GPL-2.0+ OR BSD)

Here BSD is not a valid license expression and even adding a rule is insufficient because the SPDX-License-Identifier based detection was moved before the hash license detection.

We should either:

  1. do the hash license detection first so we can catch these with rules, and then do the SPDX identifier based detection
  2. if we get unknown-spdx we consider license detection with rules
  3. Also optionally consider license detection with required phrase rules if nothing works (would lose license expression info for this potentially)?

AyanSinhaMahapatra avatar Sep 09 '24 11:09 AyanSinhaMahapatra

create a rule for gpl-2.0-plus AND bsd-new with this text

SPDX-License-Identifier: (GPL-2.0+ OR BSD)

and make this 99 relevant

that's the approach for BSD's that will be picked over the SPDX detection, it should at least

pombredanne avatar Sep 12 '24 12:09 pombredanne

Here are examples https://github.com/search?q="SPDX-License-Identifier%3A+(GPL-2.0%2B+OR+BSD)"&type=code and

https://github.com/BPI-SINOVOIP/BPI-R2PRO-BSP/blob/938b4b14d8ee8e332a6cf04111a11d9a95156a6d/kernel/include/dt-bindings/reset/amlogic%2Cmeson-axg-reset.h#L9

pombredanne avatar Sep 12 '24 18:09 pombredanne

I pushed a fix in https://github.com/aboutcode-org/scancode-toolkit/pull/3905/commits/c581828c12c5b692f9b0c080f4da07b9e014285f

The default sort order or LicenseMatch was based on the "matcher" string, hence "1-spdx-id" would always beat a "2-aho" match. Now we have a new "matcher_order" integer attribute that is used to sort instead and the hash and aho always take precedence over SPDX.

pombredanne avatar Sep 12 '24 21:09 pombredanne