goja icon indicating copy to clipboard operation
goja copied to clipboard

Migrate RegExp to regonaut engine

Open auvred opened this issue 3 months ago • 3 comments

This PR replaces both https://pkg.go.dev/regexp and https://github.com/dlclark/regexp2 usages with https://github.com/auvred/regonaut (an ES2025-compatible RegExp engine).

I've tried to keep diff minimal for easier review. For now, this PR only replaces the RegExp engine and enables some TC39 tests - no new features are introduced yet.

If this PR is merged, I plan to submit a few follow-up PRs to add support for:

  • Unicode sets (v flag)
  • Named groups
  • Match Indices (d flag)

auvred avatar Sep 13 '25 08:09 auvred

This looks very promising. When I looked into named groups I realised regexp2 was not the right fit as it's a port of the .NET library which has a very limited ECMAScript compatibility, and in order to make it fully compatible some fundamental changes are needed. I even considered forking it at some point but then I realised I didn't have enough time to make those changes and maintain the code.

I've had a quick look and my main concern so far is performance. Even running the very limited set of benchmarks from goja shows a significant degradation in both time and memory:

                                │ regexp_before.txt │           regexp_after.txt            │
                                │      sec/op       │    sec/op     vs base                 │
RegexpSplitWithBackRef-16               3.243µ ± 1%   24.268µ ± 1%  +648.45% (p=0.000 n=10)
RegexpMatch-16                          44.84µ ± 1%   280.40µ ± 1%  +525.40% (p=0.000 n=10)
RegexpMatchCache-16                     1.417m ± 1%    4.315m ± 1%  +204.40% (p=0.000 n=10)
RegexpMatchAll-16                       1.968m ± 1%    5.059m ± 1%  +157.05% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16            751.8n ± 1%   2196.0n ± 2%  +192.10% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16           1.104µ ± 1%    2.424µ ± 2%  +119.52% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16          1.383µ ± 1%    2.115µ ± 1%   +52.93% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16         1.141µ ± 2%    2.365µ ± 1%  +107.27% (p=0.000 n=10)
geomean                                 12.32µ         37.55µ       +204.77%

                                │ regexp_before.txt │           regexp_after.txt            │
                                │       B/op        │     B/op      vs base                 │
RegexpMatchCache-16                    1.336Mi ± 0%   8.958Mi ± 0%  +570.49% (p=0.000 n=10)
RegexpMatchAll-16                      1.991Mi ± 0%   9.611Mi ± 0%  +382.67% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16             759.0 ± 0%    1384.0 ± 0%   +82.35% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16          1.266Ki ± 0%   1.398Ki ± 0%   +10.49% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16           793.0 ± 0%    1304.0 ± 0%   +64.44% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16        1.273Ki ± 0%   1.320Ki ± 0%    +3.68% (p=0.000 n=10)
geomean                                11.71Ki        25.68Ki       +119.28%

                                │ regexp_before.txt │           regexp_after.txt           │
                                │     allocs/op     │  allocs/op   vs base                 │
RegexpMatchCache-16                     22.34k ± 0%   33.37k ± 0%   +49.39% (p=0.000 n=10)
RegexpMatchAll-16                       31.86k ± 0%   42.91k ± 0%   +34.66% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16             11.00 ± 0%    22.00 ± 0%  +100.00% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16            18.00 ± 0%    24.00 ± 0%   +33.33% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16           14.00 ± 0%    21.00 ± 0%   +50.00% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16          20.00 ± 0%    23.00 ± 0%   +15.00% (p=0.000 n=10)
geomean                                  184.5         267.4        +44.89%

Do you plan to do any performance improvements? The thing is most people are not even aware of the ECMAScript regular expression quirks, but they would notice if their ^[a-z]+$ suddenly ran slower...

dop251 avatar Sep 14 '25 15:09 dop251

Do you plan to do any performance improvements? The thing is most people are not even aware of the ECMAScript regular expression quirks, but they would notice if their ^[a-z]+$ suddenly ran slower...

I've been planning to implement a second finite automata engine (alongside the existing backtracking engine) to improve performance. Unfortunately, implementing this new engine is quite time-consuming, and I'm currently limited on time. I'll make this PR a draft for now and will come back to it once the finite automata engine is finished.

auvred avatar Sep 16 '25 11:09 auvred

Thanks. I'll add a couple of comments on the PR, as they would still apply...

dop251 avatar Sep 24 '25 20:09 dop251