joni Valid UTF-8 input can cause infinite loop in JONI

In #7, @electrum identified a location that can cause inifinite loop in JONI. It is marked as won't fix because input can be sanitized beforehand and JONI assumes that the input is always valid.

When the pattern is "\uD8000", it can be pre-sanitized, as you suggested in #7. What if the pattern is "\\uD800"? How can the user sanitize it?

If JONI is willing to add a check, it would be the same fix for #7, checking whether the return value of enc.length is negative in OptExactInfo.concatStr.

Mar 18 '15 22:03 haozhun

In addition, \uD800\uDC00, which is a legal sequence, will also result in infinite loop, because JONI consider every \uXXXX as a code point.

Mar 26 '15 00:03 haozhun

@haozhun - can you show some jruby or java code that illustrates the endless loop?

Apr 26 '16 14:04 guyboertje

Note that in the past year we did add the ability to interrupt joni when it's stuck looping on bad input (or just large input/slow regex).

@haozhun Can you propose a patch? @lopex would probably be the best one to review such a change.

May 02 '16 17:05 headius

Java code that illustrate the infinite loop. This can be mitigated by using NonStrict... instead as illustrated in the commented out code.

    public static void main(String[] args)
    {
        byte[] pattern = "A\\uD800".getBytes(StandardCharsets.UTF_8);
        byte[] str = ("AB").getBytes(StandardCharsets.UTF_8);
        Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, UTF8Encoding.INSTANCE, Syntax.Java);
        // Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, NonStrictUTF8Encoding.INSTANCE, Syntax.Java);
        Matcher matcher = regex.matcher(str);
        int result = matcher.search(0, str.length, Option.DEFAULT);
        System.out.println(result);
    }

Patch: https://github.com/jruby/joni/pull/21

May 03 '16 20:05 haozhun

Ahh I see, this does not apply to JRuby (checked 1.7.24) because there is a range check.

raises RegexpError: invalid Unicode range: /A\uD800/

May 04 '16 09:05 guyboertje

joni joni copied to clipboard

Valid UTF-8 input can cause infinite loop in JONI

joni
joni copied to clipboard