jdk
jdk copied to clipboard
8354490: Pattern.CANON_EQ causes a pattern to not match a string with a UNICODE variation
The root cause is an off-by-one bug introduced in an old change we made years ago for Pattern.CANON_EQ. See https://cr.openjdk.org/~sherman/regexCE/Note.txt for background info.
As described in the writeup above the basic logic of the change is to:
generate the permutations, create the alternation and then put it appropriately into the character class (logically), we now use a special "Node", the NFCCharProperty to do the matching work. The NFCCharProperty tries to match a grapheme cluster at a time (nfc greedly, then backtrack) against the character class.
It appears we have a off-by-one bug in the backtrack boundary condition check, when it backtracking to the position 'after' the base(main) character (in case where the resulting 'nfc' string is not a single character' string /not match). In such cases, we still need to match/compare the base character against the predicate to find the potential match.
For example in the reported scenario, the target string contains the pair of u+2764 (emoji) + u+fe0f (variation selector/emoji_component). The boundary edge j = Grapheme.nextBoundary() starts at 2 (after u+fe0f), then it backtracks to 1. The current boundary check implementation incorrectly exits here because 0 + 1 < 1 fails, which is incorrect.
This emoji pair should match correctly, s showed below
jshell> var p = Pattern.compile("\\p{IsEmoji}\\p{IsEmoji_Component}", Pattern.CANON_EQ);
p ==> \p{IsEmoji}\p{IsEmoji_Component}
jshell> p.matcher("\u2764\ufe0f").matches();
$53 ==> true
or
jshell> var p = Pattern.compile("\\p{IsEmoji}", Pattern.CANON_EQ);
p ==> \p{IsEmoji}
jshell> p.matcher("\u2764\ufe0f").find();
$55 ==> true
This bug is not limited to the emoji + variation selector pairs (which don't 'nfc' into a single character, even are treated as a single grapheme cluster). It also impacts cases involing dangling or unmatched combining character(s). For example, the following should work/match/find, even in Pattern.CANON_EQ mode.
jshell> p = Pattern.compile("\\p{IsGreek}\\p{IsAlphabetic}", Pattern.CANON_EQ);
p ==> \p{IsGreek}\p{IsAlphabetic}
jshell> p.matcher("\u1f80\u0345").matches();
$57 ==> true
jshell> p = Pattern.compile("[\\p{IsAlphabetic}]*", Pattern.CANON_EQ);
p ==> [\p{IsAlphabetic}]*
jshell> p.matcher("\u1f80\u0345").matches();
$59 ==> true
note: the grapheme boundary is not necessary the same as the resulting nfc boundary.
Progress
- [ ] Change must be properly reviewed (1 review required, with at least 1 Reviewer)
- [x] Change must not contain extraneous whitespace
- [x] Commit message must refer to an issue
Issue
- JDK-8354490: Pattern.CANON_EQ causes a pattern to not match a string with a UNICODE variation (Bug - P3)
Reviewing
Using git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25986/head:pull/25986
$ git checkout pull/25986
Update a local copy of the PR:
$ git checkout pull/25986
$ git pull https://git.openjdk.org/jdk.git pull/25986/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25986
View PR using the GUI difftool:
$ git pr show -t 25986
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25986.diff
Using Webrev
:wave: Welcome back sherman! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.
@xuemingshen-oracle This change now passes all automated pre-integration checks.
ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.
After integration, the commit message for the final commit will be:
8354490: Pattern.CANON_EQ causes a pattern to not match a string with a UNICODE variation
Reviewed-by: rriggs, naoto
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.
At the time when this comment was updated there had been 54 new commits pushed to the master branch:
- 9d518b3213af7c60cb604138a2c4022181bb2daa: 8310831: Some methods are missing from CDS regenerated JLI holder class
- 1dda79cfab597782e0a7bb63af6dcc30aeff62d1: 8360743: Enables regeneration of JLI holder classes for CDS static dump
- aa1911191cf8c2b855268a76baf0757909d66d1b: 8360867: CTW: Disable inline cache verification
- ... and 51 more: https://git.openjdk.org/jdk/compare/ba0c12231b0f5b680951e75765b5d292f31a2cbc...master
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.
➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.
@xuemingshen-oracle The following label will be automatically applied to this pull request:
core-libs
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.
Thanks for the reviews! /integrate
Going to push as commit 61a590e9bea64ddfd465a5e6f224bc2979d841e9.
Since your change was applied there have been 54 commits pushed to the master branch:
- 9d518b3213af7c60cb604138a2c4022181bb2daa: 8310831: Some methods are missing from CDS regenerated JLI holder class
- 1dda79cfab597782e0a7bb63af6dcc30aeff62d1: 8360743: Enables regeneration of JLI holder classes for CDS static dump
- aa1911191cf8c2b855268a76baf0757909d66d1b: 8360867: CTW: Disable inline cache verification
- ... and 51 more: https://git.openjdk.org/jdk/compare/ba0c12231b0f5b680951e75765b5d292f31a2cbc...master
Your commit was automatically rebased without conflicts.
@xuemingshen-oracle Pushed as commit 61a590e9bea64ddfd465a5e6f224bc2979d841e9.
:bulb: You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.