Update rules with required phrases automatically
This is a continuation of https://github.com/aboutcode-org/scancode-toolkit/pull/3254 with added required phrases in license rules after review and further manual curations. Also contains improvements in required phrase collection and marking.
Reference: #2637 #2878
Tasks
- [x] Reviewed contribution guidelines
- [x] PR is descriptively titled 📑 and links the original issue above 🔗
- [ ] Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR Run tests locally to check for errors.
- [x] Commits are in uniquely-named feature branch and has no merge conflicts 📁
I am pushing shortly a few updates:
- decouple the creation of new rules from updating existing rules in a separate CLI
- ensure we skip more rules in the whole process: any rule that cannot be matched approximately and not only tiny rules, and also false positives
- ensure that no rule get a required phrase addition that would break in the middle of a URL, email, or copyright. This will be done to check that no required phrase injection changes the set of ignorables of a rule and makes the URL not longer a proper URL for instance.
- extend "skipping" the collection of required phrases flag to skip a rule from both required phrases collection AND injection. This allow to handle exceptions more easily.
Hey! This PR looks promising as I think it has the potential to greatly reduce false positives. Just to drop an example, having required phrases throughout the rule set would avoid "silly" mistakes like this:
"rule_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/lgpl-2.1_250.RULE",
"matched_text": "// Licensed under the MIT license. \nnamespace Microsoft.OpenApi.Readers.V2"
And I do see quite a lot of these "ghost" detections.
I was just checking in to ask if there is any hope of seeing this PR merged. It has been open for some time.
@petergardfjall re:
I was j ust checking in to ask if there is any hope of seeing this PR merged. It has been open for some time.
Yes! I am about to have a chat with @AyanSinhaMahapatra on just that.
The longer story is that there are reconciliations to do wrt. the changes I made and changes that @AyanSinhaMahapatra has not yet pushed.
@AyanSinhaMahapatra all green... I am merging now :bow: