Catch common pattern errors with `\b`
Description:
The intention of this change is to detect common pattern errors (specifically when using \b).
For example, given the problematic pattern glpat-[\w-]{20}\b, the following error will be produced:
$ go generate ./...
{"level":"fatal","error":"regex contains word boundary (\\b) beside non-word character (i.e., not [A-Za-z0-9_]): glpat-[\\w-]{20}\\b","rule-id":"gitlab-pat","time":"2024-09-09T13:34:24-04:00","message":"Failed to validate rule"}
exit status 1
cmd/generate/config/main.go:17: running "go": exit status 1
I've discussed why this is an issue in a few different places, most recently: https://github.com/leaktk/patterns/pull/41#discussion_r1723758978
Open Questions
-
[ ] Should something permission like
(?:.|\s){0,200}count as a non-word character? I am leaning towards yes because it's a lack of specificity, but am stuck.The Kubernetes pattern fails if "anything" is considered a non-word character. https://github.com/gitleaks/gitleaks/blob/e93a7c0d2604fd1bcc43ac9cac6144a62387a8a4/cmd/generate/config/rules/kubernetes.go#L48
-
[ ] Should this fail or simply log a warning?
-
[ ] Check for two word characters together (
a\bc) which is also invalid -
[ ] Always populate before/after for better debugging
Checklist:
- [x] Does your PR pass tests?
- [x] Have you written new tests for your changes?
- [x] Have you lint your code locally prior to submission?
This PR isn't ready to merge but does need some feedback.
Should this fail or simply log a warning?
I think this is safe to fail since it only gets called during the generation of the config and not during scans.
Should something permission like (?:.|\s){0,200} count as a non-word character? I am leaning towards yes because it's a lack of specificity, but am stuck.
That's a tough one, I'm not sure either.
Overall I like this change since it enhances already existing guardrails during the config generation.
@bplaxco Some interesting results after testing against leaktk/patterns:
8:45AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=-, after=, segment=(?i:\\bXOXB-[0-9]{10,13}-[0-9]{10,13}[\\-0-9A-Za-zſK]*\\b)"
8:46AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=-, after=, segment=\\bglpat-[\\-0-9A-Z_a-z]{20}\\b"
8:47AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=-, after=, segment=\\b(?:glrt-|GR1348941)[\\-0-9A-Z_a-z]{20}\\b"
8:47AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=-, after=, segment=(?i:\\bXOX[EPep](?:-[0-9]{10,13}){3}-[\\-0-9A-Za-zſK]{28,34}\\b)"
8:48AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=\x00-\b, after=, segment=(?-s:\\boc[\\t\\n\\f\\r ]+login[\\t\\n\\f\\r ]+.*?--token[\\t\\n\\f\\r ]*=?[\\t\\n\\f\\r ]*(sha256~[^\\t\\n\\f\\r <]{16,})\\b)"
8:50AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=-, after=, segment=\\bSG\\.[\\-0-9A-Z_a-z]{16,32}\\.[\\-0-9A-Z_a-z]{16,64}\\b"
8:50AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=, after=, segment=(?i:\\bXOXE-[0-9]-[0-9A-Za-zſK]{146}\\bJ)"
8:58AM FTL Failed to load config error="regex contains word boundary (\\b) '' non-word character (i.e., not [A-Za-z0-9_]): before=+, after=, segment=(?i:AWS[\\-0-9A-Z_a-zſK]{0,32}[\"']?[\\t\\n\\f\\r ]*?[\\(:=][\\t\\n\\f\\r ]*?[\"']?([\\+/-9A-Za-zſK]{40})\\b)"
(The output could be a lot clearer/more friendly. Basically, /b is in an invalid position, potentially missing results.)
@bplaxco Some interesting results after testing against leaktk/patterns:
Thanks for pointing those out :ok_hand: I'll update the patterns and my pattern compiler to check for that ^_^