joni
joni copied to clipboard
Valid UTF-8 input can cause infinite loop in JONI
In #7, @electrum identified a location that can cause inifinite loop in JONI. It is marked as won't fix because input can be sanitized beforehand and JONI assumes that the input is always valid.
When the pattern is "\uD8000", it can be pre-sanitized, as you suggested in #7. What if the pattern is "\\uD800"? How can the user sanitize it?
If JONI is willing to add a check, it would be the same fix for #7, checking whether the return value of enc.length is negative in OptExactInfo.concatStr.
In addition, \uD800\uDC00, which is a legal sequence, will also result in infinite loop, because JONI consider every \uXXXX as a code point.
@haozhun - can you show some jruby or java code that illustrates the endless loop?
Note that in the past year we did add the ability to interrupt joni when it's stuck looping on bad input (or just large input/slow regex).
@haozhun Can you propose a patch? @lopex would probably be the best one to review such a change.
Java code that illustrate the infinite loop. This can be mitigated by using NonStrict... instead as illustrated in the commented out code.
public static void main(String[] args)
{
byte[] pattern = "A\\uD800".getBytes(StandardCharsets.UTF_8);
byte[] str = ("AB").getBytes(StandardCharsets.UTF_8);
Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, UTF8Encoding.INSTANCE, Syntax.Java);
// Regex regex = new Regex(pattern, 0, pattern.length, Option.NEGATE_SINGLELINE, NonStrictUTF8Encoding.INSTANCE, Syntax.Java);
Matcher matcher = regex.matcher(str);
int result = matcher.search(0, str.length, Option.DEFAULT);
System.out.println(result);
}
Patch: https://github.com/jruby/joni/pull/21
Ahh I see, this does not apply to JRuby (checked 1.7.24) because there is a range check.
raises RegexpError: invalid Unicode range: /A\uD800/