dk.brics.automaton
dk.brics.automaton copied to clipboard
Regex doesn't handle surrogate pairs properly
Hi, thank you for providing a great regular expression library!
I have noticed that brics handles input regex string as a sequence of java.lang.Character
, and this could cause a somewhat unintuitive behavior.
For example, 𠀋<𠮟<𡵅
as a Unicode Scalar Value (0x2000b
, 0x20b9f
, 0x21d45
respectively, all of them will be expressed with surrogate pairs), but automaton created from [𠀋-𡵅]
doesn't accept 𠮟
.
private static void checkOneCodePoint(final String s) {
if (s.codePointCount(0, s.length()) != 1) throw new IllegalArgumentException();
}
public static boolean testBrics(final String a, final String b, final String c) {
checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
final RegExp regex = new RegExp("[" + a + "-" + c + "]");
return regex.toAutomaton().run(b);
}
public static boolean testJava(final String a, final String b, final String c) {
checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
final Pattern pattern = Pattern.compile("[" + a + "-" + c + "]");
return pattern.matcher(b).matches();
}
public static void main(final String[] args) throws IOException {
final String a = new String(new int[]{0x2000b}, 0, 1); // 𠀋
final String b = new String(new int[]{0x20b9f}, 0, 1); // 𠮟
final String c = new String(new int[]{0x21d45}, 0, 1); // 𡵅
System.out.println(testBrics(a, b, c)); // false
System.out.println(testJava(a, b, c)); // true
}
Fixing this would require us to do
- Read regex string as (not
java.lang.Character
-by-java.lang.Character
but) Code Point stream. This also includes fixes for operator precedence, like𠀋+
. - Convert them to
java.lang.Character
s, and if they involve surrogate pairs, do something similar to what we do for numerical interval<n-m>
Although won't-fix totally make sense, it'd be great if we could find this fact in the documentation.
Thanks,
I have found a similar issue:
When I try to generate strings for a regex containing emoji using the following code :
RegExp r = new RegExp("[😁-😘]{1}");
Automaton a = r.toAutomaton();
System.out.println("automaton=" + a.toString());
the toString() produces the following:
automaton=initial state: 0
state 0 [reject]:
\ud83d -> 1
\ude18 -> 1
state 1 [accept]:
The emoji's used are:
😁
U+1F601
\0xD83D \0xDE01
55357 56833
1101 1000 0011 1101 1101 1110 0000 0001
to
😘
U+1F618
\0xD83D \0xDE18
55357 56856
1101 1000 0011 1101 1101 1110 0001 1000
You cannot fix range handling of Unicode chars that are outside the 16bit values possible with a single Java char, as per above.
However, you can coax Brics to handle repetition and other tokens using grouping (😘)
and could potentially replace the range with explicit alternatives.
Just leaving this up here in case its of use to someone.
// GIVEN; input that contains multiple supplementary characters
final String input = "😘😘😘abc";
// WHEN; using standard broken for for repetition emojis
final RegExp broke = new RegExp("😘+abc");
final Automaton aBroke = broke.toAutomaton();
// THEN; does not match incorrectly
System.out.println("Broke: " + aBroke.run(input));
// WHEN; coaxing brics to treat two separate chars as single entity
final RegExp fix = new RegExp("(😘)+abc");
final Automaton aFix = fix.toAutomaton();
// THEN; does match correctly
System.out.println("Fixed: " + aFix.run(input));
Will result in:
Broke: false
Fixed: true
And here is another example
// GIVEN; input that contains multiple supplementary characters
final String input = "😘😤😘😤😘abc";
// WHEN; using standard broken for for repetition emojis
final RegExp broke = new RegExp("[😤😘]+abc");
final Automaton aBroke = broke.toAutomaton();
// THEN; does not match incorrectly
System.out.println("Broke: " + aBroke.run(input));
// WHEN; coaxing brics to treat two separate chars as single entity
final RegExp fix = new RegExp("((😘)|(😤))+abc");
final Automaton aFix = fix.toAutomaton();
// THEN; does match correctly
System.out.println("Fixed: " + aFix.run(input));
In the last example, the broken one actually returns true but I believe its not matching correctly simply treating each of the 4 chars that make up the 2 code points as allowed and therefore matching on the string. However, it could potentially match other code points incorrectly.
FIY you can also match code point ranges. I had made a corresponding pull request a while ago (PR #35). The relevant function is makeCodePointRange( int min, int max ), which returns an Automaton matching the given (valid) code point range.