dk.brics.automaton icon indicating copy to clipboard operation
dk.brics.automaton copied to clipboard

Regex doesn't handle surrogate pairs properly

Open MiSawa opened this issue 6 years ago • 3 comments

Hi, thank you for providing a great regular expression library!

I have noticed that brics handles input regex string as a sequence of java.lang.Character, and this could cause a somewhat unintuitive behavior.

For example, 𠀋<𠮟<𡵅 as a Unicode Scalar Value (0x2000b, 0x20b9f, 0x21d45 respectively, all of them will be expressed with surrogate pairs), but automaton created from [𠀋-𡵅] doesn't accept 𠮟.

private static void checkOneCodePoint(final String s) {
    if (s.codePointCount(0, s.length()) != 1) throw new IllegalArgumentException();
}

public static boolean testBrics(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final RegExp regex = new RegExp("[" + a + "-" + c + "]");
    return regex.toAutomaton().run(b);
}

public static boolean testJava(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final Pattern pattern = Pattern.compile("[" + a + "-" + c + "]");
    return pattern.matcher(b).matches();
}

public static void main(final String[] args) throws IOException {
    final String a = new String(new int[]{0x2000b}, 0, 1); // 𠀋
    final String b = new String(new int[]{0x20b9f}, 0, 1); // 𠮟
    final String c = new String(new int[]{0x21d45}, 0, 1); // 𡵅
    System.out.println(testBrics(a, b, c)); // false
    System.out.println(testJava(a, b, c)); // true
}

Fixing this would require us to do

  • Read regex string as (not java.lang.Character-by-java.lang.Character but) Code Point stream. This also includes fixes for operator precedence, like 𠀋+.
  • Convert them to java.lang.Characters, and if they involve surrogate pairs, do something similar to what we do for numerical interval <n-m>

Although won't-fix totally make sense, it'd be great if we could find this fact in the documentation.

Thanks,

MiSawa avatar Jan 31 '19 07:01 MiSawa

I have found a similar issue:

When I try to generate strings for a regex containing emoji using the following code :

RegExp r = new RegExp("[😁-😘]{1}");
Automaton a = r.toAutomaton();
System.out.println("automaton=" + a.toString());

the toString() produces the following:

automaton=initial state: 0
state 0 [reject]:
      \ud83d -> 1
      \ude18 -> 1
    state 1 [accept]:

The emoji's used are:

😁

U+1F601
\0xD83D               \0xDE01
55357                 56833
1101 1000 0011 1101   1101 1110 0000 0001

to

😘

U+1F618
\0xD83D               \0xDE18
55357                 56856
1101 1000 0011 1101   1101 1110 0001 1000

mrspaceman avatar Feb 04 '19 10:02 mrspaceman

You cannot fix range handling of Unicode chars that are outside the 16bit values possible with a single Java char, as per above.

However, you can coax Brics to handle repetition and other tokens using grouping (😘) and could potentially replace the range with explicit alternatives.

Just leaving this up here in case its of use to someone.

    // GIVEN; input that contains multiple supplementary characters
    final String input = "😘😘😘abc";
    // WHEN; using standard broken for for repetition emojis
    final RegExp broke = new RegExp("😘+abc");
    final Automaton aBroke = broke.toAutomaton();

    // THEN; does not match incorrectly
    System.out.println("Broke: " + aBroke.run(input));

    // WHEN; coaxing brics to treat two separate chars as single entity
    final RegExp fix = new RegExp("(😘)+abc");
    final Automaton aFix = fix.toAutomaton();

    // THEN; does match correctly
    System.out.println("Fixed: " + aFix.run(input));

Will result in:

Broke: false
Fixed: true

And here is another example

    // GIVEN; input that contains multiple supplementary characters
    final String input = "😘😤😘😤😘abc";
    // WHEN; using standard broken for for repetition emojis
    final RegExp broke = new RegExp("[😤😘]+abc");
    final Automaton aBroke = broke.toAutomaton();

    // THEN; does not match incorrectly
    System.out.println("Broke: " + aBroke.run(input));

    // WHEN; coaxing brics to treat two separate chars as single entity
    final RegExp fix = new RegExp("((😘)|(😤))+abc");
    final Automaton aFix = fix.toAutomaton();

    // THEN; does match correctly
    System.out.println("Fixed: " + aFix.run(input));

In the last example, the broken one actually returns true but I believe its not matching correctly simply treating each of the 4 chars that make up the 2 code points as allowed and therefore matching on the string. However, it could potentially match other code points incorrectly.

turf00 avatar May 17 '23 08:05 turf00

FIY you can also match code point ranges. I had made a corresponding pull request a while ago (PR #35). The relevant function is makeCodePointRange( int min, int max ), which returns an Automaton matching the given (valid) code point range.

DirkToewe avatar May 17 '23 09:05 DirkToewe