better-cpp-syntax icon indicating copy to clipboard operation
better-cpp-syntax copied to clipboard

Catastrophic backtracking in `function_definition` begin regex

Open alexdima opened this issue 4 years ago • 12 comments

Coming via https://github.com/microsoft/vscode/issues/117264

The regex for function_definition takes >1s to evaluate at each step. This makes VS Code freeze for some minutes until it eventually recovers. If the regex would stop using \G, then we would cache its search results and even if it would still be slow, at least the cost would be 1s in total for the entire line. As it is right now, the usage of \G prohibits us from caching the search results, which means we need to re-search at each necessary offset (at each offset where tokenization advances).

the regex
(?:(?:^|\G|(?<=;|\}))|(?<=>))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?:((?<!\w)template(?!\w))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))?((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:((?<!\w)(?:(?:(?:constexpr)|(?:explicit)|(?:mutable)|(?:virtual)|(?:inline)|(?:friend))|(?:(?:volatile)|(?:register)|(?:restrict)|(?:static)|(?:extern)|(?:const)))(?!\w))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*)(\s*+((?:(?:(?:\[\[.*?\]\]|__attribute(?:__)?\s*\(\s*\(.*?\)\s*\))|__declspec\(.*?\))|alignas\(.*?\))(?!\)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?:(?:(?:(?:unsigned)|(?:signed)|(?:short)|(?:long))|(?:(?:struct)|(?:class)|(?:union)|(?:enum)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*((?:::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?<!\w)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*(?!\w)\s*+(((?<!<)<(?!<)(?:(?:(?:[^'"<>]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*+)?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?!(?:(?:transaction_safe_dynamic)|(?:__has_cpp_attribute)|(?:reinterpret_cast)|(?:transaction_safe)|(?:atomic_noexcept)|(?:atomic_commit)|(?:atomic_cancel)|(?:__has_include)|(?:dynamic_cast)|(?:synchronized)|(?:thread_local)|(?:static_cast)|(?:const_cast)|(?:constexpr)|(?:consteval)|(?:co_return)|(?:co_return)|(?:constexpr)|(?:protected)|(?:constexpr)|(?:namespace)|(?:noexcept)|(?:typename)|(?:decltype)|(?:template)|(?:operator)|(?:noexcept)|(?:co_yield)|(?:co_await)|(?:continue)|(?:co_await)|(?:co_yield)|(?:volatile)|(?:register)|(?:restrict)|(?:explicit)|(?:override)|(?:volatile)|(?:reflexpr)|(?:noexcept)|(?:requires)|(?:alignas)|(?:typedef)|(?:nullptr)|(?:alignof)|(?:mutable)|(?:concept)|(?:virtual)|(?:defined)|(?:__asm__)|(?:include)|(?:_Pragma)|(?:mutable)|(?:default)|(?:warning)|(?:private)|(?:module)|(?:return)|(?:not_eq)|(?:xor_eq)|(?:and_eq)|(?:ifndef)|(?:pragma)|(?:export)|(?:import)|(?:sizeof)|(?:static)|(?:delete)|(?:public)|(?:define)|(?:extern)|(?:inline)|(?:typeid)|(?:switch)|(?:friend)|(?:bitand)|(?:false)|(?:compl)|(?:bitor)|(?:throw)|(?:or_eq)|(?:while)|(?:catch)|(?:break)|(?:const)|(?:final)|(?:const)|(?:endif)|(?:ifdef)|(?:undef)|(?:error)|(?:using)|(?:audit)|(?:axiom)|(?:line)|(?:else)|(?:elif)|(?:true)|(?:NULL)|(?:case)|(?:goto)|(?:else)|(?:this)|(?:new)|(?:asm)|(?:not)|(?:and)|(?:xor)|(?:try)|(?:for)|(?:if)|(?:do)|(?:or)|(?:if))\b)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*\b((?<!<)<(?!<)(?:(?:(?:[^'"<>]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)?(?![\w<:.]))(((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))?(?:(?:&|(?:\*))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z))))*(?:&|(?:\*)))?((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((?:__cdecl|__clrcall|__stdcall|__fastcall|__thiscall|__vectorcall)?)((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))((::)?(?:(?!\b(?:__has_cpp_attribute|reinterpret_cast|atomic_noexcept|atomic_commit|atomic_cancel|__has_include|synchronized|dynamic_cast|thread_local|static_cast|const_cast|co_return|constexpr|constexpr|constexpr|co_return|protected|namespace|consteval|noexcept|decltype|template|operator|noexcept|co_yield|co_await|reflexpr|continue|co_await|co_yield|requires|volatile|register|restrict|explicit|volatile|noexcept|typename|default|_Pragma|mutable|include|concept|alignas|virtual|alignof|__asm__|defined|mutable|typedef|warning|private|and_eq|define|pragma|typeid|switch|bitand|return|ifndef|export|struct|sizeof|module|static|public|extern|inline|friend|delete|xor_eq|import|not_eq|class|compl|bitor|throw|or_eq|while|catch|break|union|const|const|endif|ifdef|undef|error|using|else|line|goto|else|elif|this|enum|case|new|asm|not|try|for|and|xor|or|if|do|if)\b)(?<!\w)(?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*(?!\w)\s*+(((?<!<)<(?!<)(?:(?:(?:[^'"<>]*+|"(?:[^"]*|\\")")|'(?:[^']*|\\')')\g<60>?)+>)(?:\s)*+)?::)*\s*+)((?:[a-zA-Z_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))(?:[a-zA-Z0-9_]|(?:\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}))*)\b(?<!\Wreinterpret_cast|^reinterpret_cast|\Watomic_noexcept|^atomic_noexcept|\Wuint_least16_t|^uint_least16_t|\Wuint_least32_t|^uint_least32_t|\Wuint_least64_t|^uint_least64_t|\Wuint_fast16_t|^uint_fast16_t|\Wint_least16_t|^int_least16_t|\Watomic_commit|^atomic_commit|\Watomic_cancel|^atomic_cancel|\Wuint_fast64_t|^uint_fast64_t|\Wuint_least8_t|^uint_least8_t|\Wint_least64_t|^int_least64_t|\Wint_least32_t|^int_least32_t|\Wuint_fast32_t|^uint_fast32_t|\Wdynamic_cast|^dynamic_cast|\Wthread_local|^thread_local|\Wuint_fast8_t|^uint_fast8_t|\Wint_fast64_t|^int_fast64_t|\Wint_fast32_t|^int_fast32_t|\Wint_fast16_t|^int_fast16_t|\Wsynchronized|^synchronized|\Wint_least8_t|^int_least8_t|\Wsuseconds_t|^suseconds_t|\Wint_fast8_t|^int_fast8_t|\Wstatic_cast|^static_cast|\Wconst_cast|^const_cast|\Wuseconds_t|^useconds_t|\Wnamespace|^namespace|\Wco_return|^co_return|\Wblksize_t|^blksize_t|\Win_addr_t|^in_addr_t|\Win_port_t|^in_port_t|\Wuintptr_t|^uintptr_t|\Wuintmax_t|^uintmax_t|\Wuintmax_t|^uintmax_t|\Wuintmax_t|^uintmax_t|\Wconstexpr|^constexpr|\Wconstexpr|^constexpr|\Wconstexpr|^constexpr|\Wconsteval|^consteval|\Wprotected|^protected|\Wco_return|^co_return|\Wco_await|^co_await|\Wnoexcept|^noexcept|\Wrestrict|^restrict|\Wnoexcept|^noexcept|\Wdecltype|^decltype|\Wintmax_t|^intmax_t|\Wuint64_t|^uint64_t|\Wintmax_t|^intmax_t|\Wcontinue|^continue|\Wreflexpr|^reflexpr|\Wintptr_t|^intptr_t|\Wuint32_t|^uint32_t|\Wuint16_t|^uint16_t|\Wexplicit|^explicit|\Wtypename|^typename|\Wu_quad_t|^u_quad_t|\Wvolatile|^volatile|\Wtemplate|^template|\Wnoexcept|^noexcept|\Wco_yield|^co_yield|\Wco_await|^co_await|\Wvolatile|^volatile|\Woperator|^operator|\Wunsigned|^unsigned|\Wregister|^register|\Wblkcnt_t|^blkcnt_t|\Wrequires|^requires|\Wco_yield|^co_yield|\Wnullptr|^nullptr|\Wqaddr_t|^qaddr_t|\Wcaddr_t|^caddr_t|\Wmutable|^mutable|\Wvirtual|^virtual|\Wmutable|^mutable|\Wdaddr_t|^daddr_t|\Wfixpt_t|^fixpt_t|\Wconcept|^concept|\Wnlink_t|^nlink_t|\Wdefault|^default|\Wwchar_t|^wchar_t|\Wsegsz_t|^segsz_t|\Wswblk_t|^swblk_t|\Wclock_t|^clock_t|\Wssize_t|^ssize_t|\W__asm__|^__asm__|\Wint16_t|^int16_t|\Wint32_t|^int32_t|\Wint64_t|^int64_t|\Wuint8_t|^uint8_t|\Wu_short|^u_short|\Walignas|^alignas|\Walignof|^alignof|\Wtypedef|^typedef|\Wprivate|^private|\Wu_char|^u_char|\Wmode_t|^mode_t|\Wstatic|^static|\Wdouble|^double|\Wnot_eq|^not_eq|\Wtypeid|^typeid|\Wmodule|^module|\Wstruct|^struct|\Wexport|^export|\Wxor_eq|^xor_eq|\Wand_eq|^and_eq|\Wu_long|^u_long|\Wquad_t|^quad_t|\Wsigned|^signed|\Wushort|^ushort|\Wimport|^import|\Wbitand|^bitand|\Wfriend|^friend|\Wtime_t|^time_t|\Wdelete|^delete|\Wsize_t|^size_t|\Wint8_t|^int8_t|\Winline|^inline|\Wextern|^extern|\Wpublic|^public|\Wsizeof|^sizeof|\Wswitch|^switch|\Wreturn|^return|\Wconst|^const|\Wshort|^short|\Wfloat|^float|\Wu_int|^u_int|\Wdiv_t|^div_t|\Wdev_t|^dev_t|\Wgid_t|^gid_t|\Wino_t|^ino_t|\Wkey_t|^key_t|\Wpid_t|^pid_t|\Woff_t|^off_t|\Wuid_t|^uid_t|\Wwhile|^while|\Wor_eq|^or_eq|\Wthrow|^throw|\Wbitor|^bitor|\Wfalse|^false|\Wclass|^class|\Wunion|^union|\Wconst|^const|\Wcompl|^compl|\Wusing|^using|\Wcatch|^catch|\Wbreak|^break|\Wtrue|^true|\Wid_t|^id_t|\Wchar|^char|\Wid_t|^id_t|\Wauto|^auto|\Wcase|^case|\Wuint|^uint|\Wbool|^bool|\Wlong|^long|\Wvoid|^void|\Wenum|^enum|\WNULL|^NULL|\Wthis|^this|\Welse|^else|\Wgoto|^goto|\Wnew|^new|\Wtry|^try|\Wxor|^xor|\Wnot|^not|\Wint|^int|\Wand|^and|\Wfor|^for|\Wasm|^asm|\Wdo|^do|\Wor|^or|\Wif|^if)((?:(?:(?:(?>(?:\s)+)|(\/\*)((?:[^\*]|(?:\*)++[^\/])*+((?:\*)++\/)))+)|(?:\b)|(?=\W)|(?<=\W)|(?:\)|(?:\Z)))(?=\()
the test string
typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t; typedef struct { long int quot; long int rem; } ldiv_t;

Here I have tested the regex at regex101.com and it leads to catastrophic backtracking: image

alexdima avatar Nov 19 '21 10:11 alexdima

I'll test removing \G when I get a chance, although I'm pretty sure it will strongly effect correctness since it's not exactly a pattern that is just inserted casually.

I tested removing it, and substituting it, and doing either breaks 80% of all tests.

jeff-hykin avatar Nov 20 '21 17:11 jeff-hykin

Can confirm function_definition causing the issue

Tho the lag seems to be constituted around variable declaration rather than the struct code image

int var;

int var; int var;

int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var; int var;

There is a double up of the comment handling group image Disabling it seems to improve performance by 2x but does not fix it entirely image image

I'm curious is (?>(?:\\s)+) not worse than \\s++? performance wise?

I too also had performance issues with (^|\\G) (in my own extension) https://github.com/RedCMD/TmLanguage-Syntax-Highlighter/commit/2b02db24c6c74d653bd99f8593ea235914e96bb6 I found splitting the rule into two different ones "match": "^..." and "match": "\\G..." was enough to fix my issue In some very specfic cases it was soo bad, that previously a 280,000 long line that could be parsed correctly, was reduced down to only 5000 characters being parsed correctly. a line from this repo actually https://github.com/textmate/c.tmbundle/blob/80c8e886b67227096a56aca6b92fe1587f76d603/Syntaxes/Platform.tmLanguage#L299 image

RedCMD avatar Dec 06 '22 01:12 RedCMD

@jeff-hykin will the double up of the comment handling capture group be removed? I have found it to be causing 50% of the lag ((?:(?:(?:(?>(?:\\s)+)|(\\/\\*)((?:[^\\*]|(?:\\*)++[^\\/])*+((?:\\*)++\\/)))+)|(?:\\b)|(?=\\W)|(?<=\\W)|(?:\\A)|(?:\\Z))) https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4144

RedCMD avatar Feb 02 '23 23:02 RedCMD

Hmm @RedCMD I don't see that pattern snippet in the file you linked.

is (?>(?:\s)+) not worse than \s++?

Yes I think its worse, and the grammar generator should automatically convert (?>(?:\\s)+) to \\s++(along with other simplifcations like removing redundancy within capture groups). However, I did notice last month I was using the original version of the grammar generator (which didn't have the optimization) so I updated it and regerated the grammar. So the last handful of releases shouldn't contain any (?>(?:\\s)+) patterns

jeff-hykin avatar Feb 03 '23 19:02 jeff-hykin

sorry. the exact snippet has changed a bit but it is still there ((?:(?:(?:\\s*+(\\/\\*)((?:[^\\*]++|\\*+(?!\\/))*+(\\*\\/))\\s*+)+)|(?:\\s++)|(?<=\\W)|(?=\\W)|^|(?:\\n?$)|(?:\\A)|(?:\\Z))) if you paste that twice in ctrl+f find, you will see it in destructor_inline, function_definition and operator_overload https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L2823-L2824 https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L4143-L4144 https://github.com/jeff-hykin/better-cpp-syntax/blob/562ec9a12ecc21179d54e9bc7bfae5f4cd4fff00/autogenerated/cpp.tmLanguage.json#L7090-L7091

image notice the double up of the 4 sets of captures image

RedCMD avatar Feb 03 '23 20:02 RedCMD

ahhhhh..... @jeff-hykin

I don't see that pattern snippet in the file you linked.

I see the problem now I've been confusing https://github.com/jeff-hykin/better-cpp-syntax/tree/master/syntaxes with https://github.com/jeff-hykin/better-cpp-syntax/tree/master/autogenerated syntaxes one is very much outdated

tho the issue is still present in the newer one

RedCMD avatar Feb 03 '23 21:02 RedCMD

syntaxes one is very much outdated

Yeah, I should ask alexr if I can go ahead and delete syntaxes/. I'm pretty sure I can but I didn't want to potentially cause any breakages.

tho the issue is still present in the newer one

Alright @RedCMD , I think I found the source of the repeated pattern problem. Some composable patterns had std_space at the end, and others had it at the begining, so when they were used next to eachother they doubled up. I got rid of that, so take a look at the autogenerated/cpp.tmLanguage.json in this commit. Hopefully that change squashed all of the repeats.

jeff-hykin avatar Feb 03 '23 21:02 jeff-hykin

still one left in operator_overload https://github.com/jeff-hykin/better-cpp-syntax/blob/190c42bd2b0ae514be32525e4b67122b52c1bb30/autogenerated/cpp.tmLanguage.json#L3595-L3596 image image

the generator really likes putting (?: ... ) groups around everything

RedCMD avatar Feb 03 '23 22:02 RedCMD

the generator really likes putting (?: ... ) groups around everything

Yeah, its cause of modularity. Code like maybe( *any possible value* ) has to be generated first as (?: *any possible value* )? until its proven that unwrapping it to be *any possible value*? would behave equivlently. Right now its not doing full regex parsing, so theres a lot of stuff that could be unwrapped that isn't. For example, I don't think character classes are unwrapped so (?:[abc])? just stays as-is since it can't prove that unwrapping is safe.

We have the concept of "single_entity", which a and (?:abc) are a single entity, but abc is not And that determines if it needs to be wrapped or not. For example

If this funciton were more intelligent, it would get rid of any unnecessary wrappings: https://github.com/jeff-hykin/ruby_grammar_builder/blob/eb206ce1b1fbde4ce8013c67e262fdc2e7c8d2a1/main/lib/ruby_grammar_builder/util.rb#L53

I tried looking around for a generic regex minimizer/optimizer/uglifier a few years ago but didn't see anything of the sort.

jeff-hykin avatar Feb 06 '23 14:02 jeff-hykin

Actually, looking at the source there were some obvious mistakes. I updated the library, and now there's 17,588 fewer chars in the generated grammar, and all the test output code is identical.

jeff-hykin avatar Feb 06 '23 15:02 jeff-hykin

I know for a fact we optimized (?:a+)? to become a* but it looks like that also is not always happening

jeff-hykin avatar Feb 06 '23 15:02 jeff-hykin

I can confirm the lag has been greatly reduced it no longer freezes VSCode, now only taking 1-2sec to tokenize

and with it being near instant for int var;

RedCMD avatar Mar 06 '24 22:03 RedCMD