re is much slower than cpython
I've been adding graal support to a classifier type project naively based on applying a bunch of regexes to an input, and while Graal works the regex application is quite slow: it's about 4x slower than cpython, while using 4 times the CPU.
Here's a repro script and attending data (basically a cut down version of the naive classifier implementation): script.zip
timings:
> python3.12 --version
Python 3.12.6
> time python3.12 run.py
75158 lines in 11.7s
156.1 us/line
python3.12 run.py 11.77s user 0.05s system 97% cpu 12.172 total
> graalpy --version
GraalPy 3.11.7 (Oracle GraalVM Native 24.1.0)
> time graalpy run.py
75158 lines in 48.2s
640.7 us/line
graalpy run.py 192.00s user 1.00s system 394% cpu 48.976 total
This is on a 10-core M1 Pro. Using cpusampler I confirmed that essentially all the "user" time is in _sre:
_search || 43580ms 98.6% || 43580ms 98.6% || <frozen graalpy._sre>~1:0
Regexes are their own compilation unit, their DFAs are JIT compiled and optimized separately just like python functions. You have 628 regexes, which takes a long time to compile. I got a better time (still worse than CPython) with the JIT compilation for regexes turned off (--experimental-options --engine.CompileOnly='~tregex re'), so it seems that the JIT compilation is too slow for regexes to pay off. I'll ask our regex experts if we can do something about it.
So I talked to one of the devs of our regex engine. The main problem is that your regexes use a lot of x{n,m} patterns, with large m. Those currently don't allow generating a resonable DFA, so the regex engine has to use a slower fallback execution model. There's currently a person working on supporting these patterns in the faster execution model, so it should get better in the next release.
Ah I see, I didn't know graal used a dfa internally, that explains things. I've had the same issue in rust as regex is also automata-based, and the way the regexes are written in the core data file also caused trouble (mostly memory, runtime did suffer a bit but not to the same extent).
I improved things by transforming the bounded repetitions back into unbounded before compilation, I'll see if I can do that for graal.
Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?
Although from my understanding the source dataset did that to limit risks of catastrophic backtracking in backtracking regex engines (like cpython's own), is there a flag exposed somewhere which indicates whether a regex uses backtracking or finite automata, to ensure I only perform rewriting when using a DFA?
Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktracking works, but it's a total hack that might break any time)
Currently, no. Our regex engine seems to have a property for that, but we currently don't expose it on the Python Pattern object. (
_sre.tregex_compile(pattern, _sre._METHOD_SEARCH, False).isBacktrackingworks, but it's a total hack that might break any time)
OK I'll go with an implementation check then at least for the time being (assuming the rewriting plan does good).
@msimacek with a "simplifier" added to the script, the timings do improve by about 25% on this machine, but that's still quite far behind cpython.
> graalpy run.py -c
75158 lines in 35.8s
476.8 us/line
graalpy run.py -c 142.91s user 0.93s system 391% cpu 36.729 total
The very minor improvement in runtime you noticed without regex compilation does seem to hold here (I get about 1 second less with the options you provideed), though the main benefit is likely CPU consumption dropping from 400% to 100%. Somewhat oddly it doesn't seem to change memory consumption much if at all, even though that was by far the biggest effect on both rust-regex and re2.
Here's the simplification routine I implemented:
REPETITION_PATTERN = re.compile(r"\{(0|1)\s*,\s*\d{3,}\}")
CLASS_PATTERN = re.compile(
r"""
\[[^]]*\\(d|w)[^]]*\]
|
\\(d|w)
""",
re.VERBOSE,
)
def class_replacer(m: re.Match[str]) -> str:
d, w = ("0-9", "A-Za-z0-9_") if m[1] else ("[0-9]", "[A-Za-z0-9_]")
return m[0].replace(r"\d", d).replace(r"\w", w)
def fa_simplifier(pattern: str) -> str:
pattern = REPETITION_PATTERN.sub(lambda m: "*" if m[1] == "0" else "+", pattern)
return CLASS_PATTERN.sub(class_replacer, pattern)
It basically converts the ranges with an upper bound of more than 3 digits to unbounded, and replaces the perl-style character classes by enumerations (that might be completely irrelevant for tregex, it's useful for rust-regex as its perl-style character classes are full unicode and thus a large amount of state).