needle icon indicating copy to clipboard operation
needle copied to clipboard

Develop a Strategy for Efficiently Handling Non-Ascii Regexes

Open hyperpape opened this issue 1 year ago • 1 comments

The typical approach for determining the next state while looping through characters is to look it up from an array, after determining the byteClass for that character. We determine the byteClass by looking up the character in a byte[129] array. Schematically:

while (state != -1 && index < length) {
    character = string.chatAt(index++);
    byteClass = byteClasses[Math.min(128, character)];
    nextState = states[stateNumber][byteClass];
}

If a regex contains non-ascii characters, then we would force the byte array to be significantly larger, and it's less obvious that this is a good strategy. As of today, we do not ever use byteClasses to look up non-ascii characters.

There are a few approaches we can take here:

  1. Some regexes may use non-ascii characters, but only from the early parts of the BMP (a regex matching Cyrillic characters, for instance). In this case, we could simply use a byteClass array larger than 128 elements, without too much overhead. This would require a degree of measurement to determine at what point the increased memory use becomes unacceptable.

  2. If we can efficiently remap characters into a smaller space, we can keep our byteClass array small. If a regex contains contiguous characters from a single alphabet, then we could just use subtraction to map it into the [0,n] interval. How well this works in any given case depends on how complex the mapping would be.

  3. Re2 and Rust's regex library have state machines that operate a byte at a time. How this would work with Java's internal use of UTF-16, I'm not yet sure.

hyperpape avatar May 05 '24 15:05 hyperpape

Added unicode benchmarks in https://github.com/hyperpape/needle-benchmarks/blob/main/src/main/java/com/justinblank/strings/SherlockTranslatedUnicodeBenchmark.java, and the results are that the performance on unicode needles is quite bad, comparable to the standard library java regex or even worse in a few cases.

hyperpape avatar Sep 02 '25 09:09 hyperpape