graalpython re is much slower than cpython

I've been adding graal support to a classifier type project naively based on applying a bunch of regexes to an input, and while Graal works the regex application is quite slow: it's about 4x slower than cpython, while using 4 times the CPU.

Here's a repro script and attending data (basically a cut down version of the naive classifier implementation): script.zip

timings:

> python3.12 --version
Python 3.12.6
> time python3.12 run.py
75158 lines in 11.7s
	156.1 us/line
python3.12 run.py  11.77s user 0.05s system 97% cpu 12.172 total
> graalpy --version
GraalPy 3.11.7 (Oracle GraalVM Native 24.1.0)
> time graalpy run.py
75158 lines in 48.2s
	640.7 us/line
graalpy run.py  192.00s user 1.00s system 394% cpu 48.976 total

This is on a 10-core M1 Pro. Using cpusampler I confirmed that essentially all the "user" time is in _sre:

 _search   ||  43580ms  98.6% ||   43580ms  98.6% || <frozen graalpy._sre>~1:0

Oct 25 '24 05:10 masklinn

Regexes are their own compilation unit, their DFAs are JIT compiled and optimized separately just like python functions. You have 628 regexes, which takes a long time to compile. I got a better time (still worse than CPython) with the JIT compilation for regexes turned off (--experimental-options --engine.CompileOnly='~tregex re'), so it seems that the JIT compilation is too slow for regexes to pay off. I'll ask our regex experts if we can do something about it.

Oct 25 '24 09:10 msimacek

So I talked to one of the devs of our regex engine. The main problem is that your regexes use a lot of x{n,m} patterns, with large m. Those currently don't allow generating a resonable DFA, so the regex engine has to use a slower fallback execution model. There's currently a person working on supporting these patterns in the faster execution model, so it should get better in the next release.

Oct 25 '24 12:10 msimacek

Ah I see, I didn't know graal used a dfa internally, that explains things. I've had the same issue in rust as regex is also automata-based, and the way the regexes are written in the core data file also caused trouble (mostly memory, runtime did suffer a bit but not to the same extent).

I improved things by transforming the bounded repetitions back into unbounded before compilation, I'll see if I can do that for graal.

Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?

Oct 25 '24 14:10 masklinn

Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?

Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktracking works, but it's a total hack that might break any time)

Oct 25 '24 15:10 msimacek

Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktracking works, but it's a total hack that might break any time)

OK I'll go with an implementation check then at least for the time being (assuming the rewriting plan does good).

Oct 25 '24 15:10 masklinn

@msimacek with a "simplifier" added to the script, the timings do improve by about 25% on this machine, but that's still quite far behind cpython.

> graalpy run.py -c
75158 lines in 35.8s
	476.8 us/line
graalpy run.py -c  142.91s user 0.93s system 391% cpu 36.729 total

The very minor improvement in runtime you noticed without regex compilation does seem to hold here (I get about 1 second less with the options you provideed), though the main benefit is likely CPU consumption dropping from 400% to 100%. Somewhat oddly it doesn't seem to change memory consumption much if at all, even though that was by far the biggest effect on both rust-regex and re2.

Here's the simplification routine I implemented:

REPETITION_PATTERN = re.compile(r"\{(0|1)\s*,\s*\d{3,}\}")
CLASS_PATTERN = re.compile(
    r"""
\[[^]]*\\(d|w)[^]]*\]
|
\\(d|w)
""",
    re.VERBOSE,
)

def class_replacer(m: re.Match[str]) -> str:
    d, w = ("0-9", "A-Za-z0-9_") if m[1] else ("[0-9]", "[A-Za-z0-9_]")
    return m[0].replace(r"\d", d).replace(r"\w", w)


def fa_simplifier(pattern: str) -> str:
    pattern = REPETITION_PATTERN.sub(lambda m: "*" if m[1] == "0" else "+", pattern)
    return CLASS_PATTERN.sub(class_replacer, pattern)

It basically converts the ranges with an upper bound of more than 3 digits to unbounded, and replaces the perl-style character classes by enumerations (that might be completely irrelevant for tregex, it's useful for rust-regex as its perl-style character classes are full unicode and thus a large amount of state).

Oct 28 '24 17:10 masklinn