Spdx-Java-Library
Spdx-Java-Library copied to clipboard
Use Unicode Properties in regex normalization license expressions
From the discussion on implementers call on 29 Oct., we could use Unicode properties in the regular expressions to simplify and possible speed up the license matching algorithms.
Reference: https://en.wikipedia.org/wiki/Unicode_character_property
Also from the call, note that
- regular expression character class is different from Unicode character class, so implementation in regex should notice the differences
- each major JDK version is tied to a specific Unicode version (Java 11-Unicode 10, Java 17-Unicode 13, Java 21-Unicode 15), and as there are new characters being added to new versions of Unicode (and they are assigned to character classes) it means that it is possible that the same regex may match one string on one JVM version but does not match it on another JVM version
Specifically for Java, the Unicode Support section of the java.util.regex.Pattern JavaDoc is very handy in finding out how to match Unicode input text.
In my experience, enabling "Unicode mode" (the (?U) inline flag or UNICODE_CHARACTER_CLASS programmatic flag), then using POSIX character classes can cause confusion. For example, these two regexes give different results for the input text "+":
[\p{Punct}]
(?U:[\p{Punct}])
(the first regex matches, the second one doesn't, because the character + is not considered punctuation in Unicode - instead it's categorised as a math symbol)
For that reason I tend to leave Unicode mode disabled, and instead build composite character classes using the explicit Unicode category support in the JVM's regex engine:
[\p{Punct}\p{IsPunctuation}] # This matches both ASCII punctuation (which includes "+") and Unicode punctuation (which does not include "+", but does include many other characters that are not ASCII punctuation)
Regardless of which method is chosen, for comprehensibility I'd encourage regex-heavy code like Spdx-Java-Library to pick one method (either Unicode character classes enabled, or not), then use it consistently everywhere. In my experience mixing and matching whether Unicode support is enabled or not is basically guaranteed to result in difficult-to-find bugs, since it changes the semantics of regexes in non-obvious ways.