better-cpp-syntax
better-cpp-syntax copied to clipboard
Catastrophic backtracking in `function_definition` begin regex
Coming via https://github.com/microsoft/vscode/issues/117264
The regex for function_definition takes >1s to evaluate at each step. This makes VS Code freeze for some minutes until it eventually recovers. If the regex would stop using \G, then we would cache its search results and even if it would still be slow, at least the cost would be 1s in total for the entire line. As it is right now, the usage of \G prohibits us from caching the search results, which means we need to re-search at each necessary offset (at each offset where tokenization advances).
the regex
(?:(?:^|\G|(?<=;|\}))|(?<=>))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?:((?<!\w)template(?!\w))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))?((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:((?<!\w)(?:(?:(?:constexpr)|(?:explicit)|(?:mutable)|(?:virtual)|(?:inline)|(?:friend))|(?:(?:volatile)|(?:register)|(?:restrict)|(?:static)|(?:extern)|(?:const)))(?!\w))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*)(\s*+((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?:(?:(?:(?:unsigned)|(?:signed)|(?:short)|(?:long))|(?:(?:struct)|(?:class)|(?:union)|(?:enum)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*((?:::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?<!\w)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*(?!\w)\s*+(((?<!<)<(?!<)(?:(?:(?:[^'"<>]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*+)?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?!(?:(?:transaction_safe_dynamic)|(?:__has_cpp_attribute)|(?:reinterpret_cast)|(?:transaction_safe)|(?:atomic_noexcept)|(?:atomic_commit)|(?:atomic_cancel)|(?:__has_include)|(?:dynamic_cast)|(?:synchronized)|(?:thread_local)|(?:static_cast)|(?:const_cast)|(?:constexpr)|(?:consteval)|(?:co_return)|(?:co_return)|(?:constexpr)|(?:protected)|(?:constexpr)|(?:namespace)|(?:noexcept)|(?:typename)|(?:decltype)|(?:template)|(?:operator)|(?:noexcept)|(?:co_yield)|(?:co_await)|(?:continue)|(?:co_await)|(?:co_yield)|(?:volatile)|(?:register)|(?:restrict)|(?:explicit)|(?:override)|(?:volatile)|(?:reflexpr)|(?:noexcept)|(?:requires)|(?:alignas)|(?:typedef)|(?:nullptr)|(?:alignof)|(?:mutable)|(?:concept)|(?:virtual)|(?:defined)|(?:__asm__)|(?:include)|(?:_Pragma)|(?:mutable)|(?:default)|(?:warning)|(?:private)|(?:module)|(?:return)|(?:not_eq)|(?:xor_eq)|(?:and_eq)|(?:ifndef)|(?:pragma)|(?:export)|(?:import)|(?:sizeof)|(?:static)|(?:delete)|(?:public)|(?:define)|(?:extern)|(?:inline)|(?:typeid)|(?:switch)|(?:friend)|(?:bitand)|(?:false)|(?:compl)|(?:bitor)|(?:throw)|(?:or_eq)|(?:while)|(?:catch)|(?:break)|(?:const)|(?:final)|(?:const)|(?:endif)|(?:ifdef)|(?:undef)|(?:error)|(?:using)|(?:audit)|(?:axiom)|(?:line)|(?:else)|(?:elif)|(?:true)|(?:NULL)|(?:case)|(?:goto)|(?:else)|(?:this)|(?:new)|(?:asm)|(?:not)|(?:and)|(?:xor)|(?:try)|(?:for)|(?:if)|(?:do)|(?:or)|(?:if))\b)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*\b((?<!<)<(?!<)(?:(?:(?:[^'"<>]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)?(?![\w<:.]))(((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))?(?:(?:&|(?:\*))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*(?:&|(?:\*)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((?:__cdecl|__clrcall|__stdcall|__fastcall|__thiscall|__vectorcall)?)((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?<!\w)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*(?!\w)\s*+(((?<!<)<(?!<)(?:(?:(?:[^'"<>]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*\s*+)((?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*)\b(?<!\Wreinterpret_cast|^reinterpret_cast|\Watomic_noexcept|^atomic_noexcept|\Wuint_least16_t|^uint_least16_t|\Wuint_least32_t|^uint_least32_t|\Wuint_least64_t|^uint_least64_t|\Wuint_fast16_t|^uint_fast16_t|\Wint_least16_t|^int_least16_t|\Watomic_commit|^atomic_commit|\Watomic_cancel|^atomic_cancel|\Wuint_fast64_t|^uint_fast64_t|\Wuint_least8_t|^uint_least8_t|\Wint_least64_t|^int_least64_t|\Wint_least32_t|^int_least32_t|\Wuint_fast32_t|^uint_fast32_t|\Wdynamic_cast|^dynamic_cast|\Wthread_local|^thread_local|\Wuint_fast8_t|^uint_fast8_t|\Wint_fast64_t|^int_fast64_t|\Wint_fast32_t|^int_fast32_t|\Wint_fast16_t|^int_fast16_t|\Wsynchronized|^synchronized|\Wint_least8_t|^int_least8_t|\Wsuseconds_t|^suseconds_t|\Wint_fast8_t|^int_fast8_t|\Wstatic_cast|^static_cast|\Wconst_cast|^const_cast|\Wuseconds_t|^useconds_t|\Wnamespace|^namespace|\Wco_return|^co_return|\Wblksize_t|^blksize_t|\Win_addr_t|^in_addr_t|\Win_port_t|^in_port_t|\Wuintptr_t|^uintptr_t|\Wuintmax_t|^uintmax_t|\Wuintmax_t|^uintmax_t|\Wuintmax_t|^uintmax_t|\Wconstexpr|^constexpr|\Wconstexpr|^constexpr|\Wconstexpr|^constexpr|\Wconsteval|^consteval|\Wprotected|^protected|\Wco_return|^co_return|\Wco_await|^co_await|\Wnoexcept|^noexcept|\Wrestrict|^restrict|\Wnoexcept|^noexcept|\Wdecltype|^decltype|\Wintmax_t|^intmax_t|\Wuint64_t|^uint64_t|\Wintmax_t|^intmax_t|\Wcontinue|^continue|\Wreflexpr|^reflexpr|\Wintptr_t|^intptr_t|\Wuint32_t|^uint32_t|\Wuint16_t|^uint16_t|\Wexplicit|^explicit|\Wtypename|^typename|\Wu_quad_t|^u_quad_t|\Wvolatile|^volatile|\Wtemplate|^template|\Wnoexcept|^noexcept|\Wco_yield|^co_yield|\Wco_await|^co_await|\Wvolatile|^volatile|\Woperator|^operator|\Wunsigned|^unsigned|\Wregister|^register|\Wblkcnt_t|^blkcnt_t|\Wrequires|^requires|\Wco_yield|^co_yield|\Wnullptr|^nullptr|\Wqaddr_t|^qaddr_t|\Wcaddr_t|^caddr_t|\Wmutable|^mutable|\Wvirtual|^virtual|\Wmutable|^mutable|\Wdaddr_t|^daddr_t|\Wfixpt_t|^fixpt_t|\Wconcept|^concept|\Wnlink_t|^nlink_t|\Wdefault|^default|\Wwchar_t|^wchar_t|\Wsegsz_t|^segsz_t|\Wswblk_t|^swblk_t|\Wclock_t|^clock_t|\Wssize_t|^ssize_t|\W__asm__|^__asm__|\Wint16_t|^int16_t|\Wint32_t|^int32_t|\Wint64_t|^int64_t|\Wuint8_t|^uint8_t|\Wu_short|^u_short|\Walignas|^alignas|\Walignof|^alignof|\Wtypedef|^typedef|\Wprivate|^private|\Wu_char|^u_char|\Wmode_t|^mode_t|\Wstatic|^static|\Wdouble|^double|\Wnot_eq|^not_eq|\Wtypeid|^typeid|\Wmodule|^module|\Wstruct|^struct|\Wexport|^export|\Wxor_eq|^xor_eq|\Wand_eq|^and_eq|\Wu_long|^u_long|\Wquad_t|^quad_t|\Wsigned|^signed|\Wushort|^ushort|\Wimport|^import|\Wbitand|^bitand|\Wfriend|^friend|\Wtime_t|^time_t|\Wdelete|^delete|\Wsize_t|^size_t|\Wint8_t|^int8_t|\Winline|^inline|\Wextern|^extern|\Wpublic|^public|\Wsizeof|^sizeof|\Wswitch|^switch|\Wreturn|^return|\Wconst|^const|\Wshort|^short|\Wfloat|^float|\Wu_int|^u_int|\Wdiv_t|^div_t|\Wdev_t|^dev_t|\Wgid_t|^gid_t|\Wino_t|^ino_t|\Wkey_t|^key_t|\Wpid_t|^pid_t|\Woff_t|^off_t|\Wuid_t|^uid_t|\Wwhile|^while|\Wor_eq|^or_eq|\Wthrow|^throw|\Wbitor|^bitor|\Wfalse|^false|\Wclass|^class|\Wunion|^union|\Wconst|^const|\Wcompl|^compl|\Wusing|^using|\Wcatch|^catch|\Wbreak|^break|\Wtrue|^true|\Wid_t|^id_t|\Wchar|^char|\Wid_t|^id_t|\Wauto|^auto|\Wcase|^case|\Wuint|^uint|\Wbool|^bool|\Wlong|^long|\Wvoid|^void|\Wenum|^enum|\WNULL|^NULL|\Wthis|^this|\Welse|^else|\Wgoto|^goto|\Wnew|^new|\Wtry|^try|\Wxor|^xor|\Wnot|^not|\Wint|^int|\Wand|^and|\Wfor|^for|\Wasm|^asm|\Wdo|^do|\Wor|^or|\Wif|^if)((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?=\()
the test string
typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t;
Here I have tested the regex at regex101.com and it leads to catastrophic backtracking:

I'll test removing \G when I get a chance, although I'm pretty sure it will strongly effect correctness since it's not exactly a pattern that is just inserted casually.
I tested removing it, and substituting it, and doing either breaks 80% of all tests.
Can confirm function_definition causing the issue
Tho the lag seems to be constituted around variable declaration rather than the struct code

int var;
int var; int var;
int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;
There is a double up of the comment handling group
Disabling it seems to improve performance by 2x
but does not fix it entirely

I'm curious
is (?>(?:\\s)+) not worse than \\s++?
performance wise?
I too also had performance issues with (^|\\G) (in my own extension)
https://github.com/RedCMD/TmLanguage-Syntax-Highlighter/commit/2b02db24c6c74d653bd99f8593ea235914e96bb6
I found splitting the rule into two different ones "match": "^..." and "match": "\\G..." was enough to fix my issue
In some very specfic cases it was soo bad, that previously a 280,000 long line that could be parsed correctly, was reduced down to only 5000 characters being parsed correctly.
a line from this repo actually https://github.com/textmate/c.tmbundle/blob/80c8e886b67227096a56aca6b92fe1587f76d603/Syntaxes/Platform.tmLanguage#L299

@jeff-hykin will the double up of the comment handling capture group be removed?
I have found it to be causing 50% of the lag
((?:(?:(?:(?>(?:\\s)+)|(\\/\\*)((?:[^\\*]|(?:\\*)++[^\\/])*+((?:\\*)++\\/)))+)|(?:\\b)|(?=\\W)|(?<=\\W)|(?:\\A)|(?:\\Z)))
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4144
Hmm @RedCMD I don't see that pattern snippet in the file you linked.
is (?>(?:\s)+) not worse than \s++?
Yes I think its worse, and the grammar generator should automatically convert (?>(?:\\s)+) to \\s++(along with other simplifcations like removing redundancy within capture groups).
However, I did notice last month I was using the original version of the grammar generator (which didn't have the optimization) so I updated it and regerated the grammar. So the last handful of releases shouldn't contain any (?>(?:\\s)+) patterns
sorry. the exact snippet has changed a bit
but it is still there
((?:(?:(?:\\s*+(\\/\\*)((?:[^\\*]++|\\*+(?!\\/))*+(\\*\\/))\\s*+)+)|(?:\\s++)|(?<=\\W)|(?=\\W)|^|(?:\\n?$)|(?:\\A)|(?:\\Z)))
if you paste that twice in ctrl+f find, you will see it in destructor_inline, function_definition and operator_overload
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L2823-L2824
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4143-L4144
https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L7090-L7091
notice the double up of the 4 sets of captures

ahhhhh..... @jeff-hykin
I don't see that pattern snippet in the file you linked.
I see the problem now
I've been confusing https://github.com/jeff-hykin/better-cpp-syntax/tree/master/syntaxes with https://github.com/jeff-hykin/better-cpp-syntax/tree/master/autogenerated
syntaxes one is very much outdated
tho the issue is still present in the newer one
syntaxes one is very much outdated
Yeah, I should ask alexr if I can go ahead and delete syntaxes/. I'm pretty sure I can but I didn't want to potentially cause any breakages.
tho the issue is still present in the newer one
Alright @RedCMD , I think I found the source of the repeated pattern problem. Some composable patterns had std_space at the end, and others had it at the begining, so when they were used next to eachother they doubled up. I got rid of that, so take a look at the autogenerated/cpp.tmLanguage.json in this commit. Hopefully that change squashed all of the repeats.
still one left in operator_overload
https://github.com/jeff-hykin/better-cpp-syntax/blob/190c42bd2b0ae514be32525e4b67122b52c1bb30/autogenerated/cpp.tmLanguage.json#L3595-L3596

the generator really likes putting (?: ... ) groups around everything
the generator really likes putting (?: ... ) groups around everything
Yeah, its cause of modularity. Code like maybe( *any possible value* ) has to be generated first as (?: *any possible value* )? until its proven that unwrapping it to be *any possible value*? would behave equivlently. Right now its not doing full regex parsing, so theres a lot of stuff that could be unwrapped that isn't. For example, I don't think character classes are unwrapped so (?:[abc])? just stays as-is since it can't prove that unwrapping is safe.
We have the concept of "single_entity", which a and (?:abc) are a single entity, but abc is not
And that determines if it needs to be wrapped or not. For example
If this funciton were more intelligent, it would get rid of any unnecessary wrappings: https://github.com/jeff-hykin/ruby_grammar_builder/blob/eb206ce1b1fbde4ce8013c67e262fdc2e7c8d2a1/main/lib/ruby_grammar_builder/util.rb#L53
I tried looking around for a generic regex minimizer/optimizer/uglifier a few years ago but didn't see anything of the sort.
Actually, looking at the source there were some obvious mistakes. I updated the library, and now there's 17,588 fewer chars in the generated grammar, and all the test output code is identical.
I know for a fact we optimized (?:a+)? to become a* but it looks like that also is not always happening
I can confirm the lag has been greatly reduced it no longer freezes VSCode, now only taking 1-2sec to tokenize
and with it being near instant for int var;