Develop a Strategy for Efficiently Handling Non-Ascii Regexes
The typical approach for determining the next state while looping through characters is to look it up from an array, after determining the byteClass for that character. We determine the byteClass by looking up the character in a byte[129] array. Schematically:
while (state != -1 && index < length) {
character = string.chatAt(index++);
byteClass = byteClasses[Math.min(128, character)];
nextState = states[stateNumber][byteClass];
}
If a regex contains non-ascii characters, then we would force the byte array to be significantly larger, and it's less obvious that this is a good strategy. As of today, we do not ever use byteClasses to look up non-ascii characters.
There are a few approaches we can take here:
-
Some regexes may use non-ascii characters, but only from the early parts of the BMP (a regex matching Cyrillic characters, for instance). In this case, we could simply use a byteClass array larger than 128 elements, without too much overhead. This would require a degree of measurement to determine at what point the increased memory use becomes unacceptable.
-
If we can efficiently remap characters into a smaller space, we can keep our byteClass array small. If a regex contains contiguous characters from a single alphabet, then we could just use subtraction to map it into the [0,n] interval. How well this works in any given case depends on how complex the mapping would be.
-
Re2 and Rust's regex library have state machines that operate a byte at a time. How this would work with Java's internal use of UTF-16, I'm not yet sure.
Added unicode benchmarks in https://github.com/hyperpape/needle-benchmarks/blob/main/src/main/java/com/justinblank/strings/SherlockTranslatedUnicodeBenchmark.java, and the results are that the performance on unicode needles is quite bad, comparable to the standard library java regex or even worse in a few cases.