Migrate RegExp to regonaut engine
This PR replaces both https://pkg.go.dev/regexp and https://github.com/dlclark/regexp2 usages with https://github.com/auvred/regonaut (an ES2025-compatible RegExp engine).
I've tried to keep diff minimal for easier review. For now, this PR only replaces the RegExp engine and enables some TC39 tests - no new features are introduced yet.
If this PR is merged, I plan to submit a few follow-up PRs to add support for:
- Unicode sets (
vflag) - Named groups
- Match Indices (
dflag)
This looks very promising. When I looked into named groups I realised regexp2 was not the right fit as it's a port of the .NET library which has a very limited ECMAScript compatibility, and in order to make it fully compatible some fundamental changes are needed. I even considered forking it at some point but then I realised I didn't have enough time to make those changes and maintain the code.
I've had a quick look and my main concern so far is performance. Even running the very limited set of benchmarks from goja shows a significant degradation in both time and memory:
│ regexp_before.txt │ regexp_after.txt │
│ sec/op │ sec/op vs base │
RegexpSplitWithBackRef-16 3.243µ ± 1% 24.268µ ± 1% +648.45% (p=0.000 n=10)
RegexpMatch-16 44.84µ ± 1% 280.40µ ± 1% +525.40% (p=0.000 n=10)
RegexpMatchCache-16 1.417m ± 1% 4.315m ± 1% +204.40% (p=0.000 n=10)
RegexpMatchAll-16 1.968m ± 1% 5.059m ± 1% +157.05% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16 751.8n ± 1% 2196.0n ± 2% +192.10% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16 1.104µ ± 1% 2.424µ ± 2% +119.52% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16 1.383µ ± 1% 2.115µ ± 1% +52.93% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16 1.141µ ± 2% 2.365µ ± 1% +107.27% (p=0.000 n=10)
geomean 12.32µ 37.55µ +204.77%
│ regexp_before.txt │ regexp_after.txt │
│ B/op │ B/op vs base │
RegexpMatchCache-16 1.336Mi ± 0% 8.958Mi ± 0% +570.49% (p=0.000 n=10)
RegexpMatchAll-16 1.991Mi ± 0% 9.611Mi ± 0% +382.67% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16 759.0 ± 0% 1384.0 ± 0% +82.35% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16 1.266Ki ± 0% 1.398Ki ± 0% +10.49% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16 793.0 ± 0% 1304.0 ± 0% +64.44% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16 1.273Ki ± 0% 1.320Ki ± 0% +3.68% (p=0.000 n=10)
geomean 11.71Ki 25.68Ki +119.28%
│ regexp_before.txt │ regexp_after.txt │
│ allocs/op │ allocs/op vs base │
RegexpMatchCache-16 22.34k ± 0% 33.37k ± 0% +49.39% (p=0.000 n=10)
RegexpMatchAll-16 31.86k ± 0% 42.91k ± 0% +34.66% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16 11.00 ± 0% 22.00 ± 0% +100.00% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16 18.00 ± 0% 24.00 ± 0% +33.33% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16 14.00 ± 0% 21.00 ± 0% +50.00% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16 20.00 ± 0% 23.00 ± 0% +15.00% (p=0.000 n=10)
geomean 184.5 267.4 +44.89%
Do you plan to do any performance improvements? The thing is most people are not even aware of the ECMAScript regular expression quirks, but they would notice if their ^[a-z]+$ suddenly ran slower...
Do you plan to do any performance improvements? The thing is most people are not even aware of the ECMAScript regular expression quirks, but they would notice if their
^[a-z]+$suddenly ran slower...
I've been planning to implement a second finite automata engine (alongside the existing backtracking engine) to improve performance. Unfortunately, implementing this new engine is quite time-consuming, and I'm currently limited on time. I'll make this PR a draft for now and will come back to it once the finite automata engine is finished.
Thanks. I'll add a couple of comments on the PR, as they would still apply...