cppfront Feature/regular expression metafunction

I will update this overview such that it is easy to grasp the status of the implementation.

Example file: example.cpp2

example: @regex type = {
  regex := "ab*bd";
}
main: (args) = {
    r := example().regex.search("abbbbbdfoo");
    std::cout << "got: (r.group(0))$" << std::endl;
}

Current status and planned on doing

Modifiers

 - [x] i                Do case-insensitive pattern matching. For example, "A" will match "a" under /i.
 - [x] m                Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
 - [x] s                Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
 - [x] x and xx         Extend your pattern's legibility by permitting whitespace and comments. Details in "/x and /xx"
 - [x] n                Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.
 - [ ] c                keep the current position during repeated matching

Escape sequences (Complete)

 - [x] \t          tab                   (HT, TAB)
 - [x] \n          newline               (LF, NL)
 - [x] \r          return                (CR)
 - [x] \f          form feed             (FF)
 - [x] \a          alarm (bell)          (BEL)
 - [x] \e          escape (think troff)  (ESC)
 - [x] \x{}, \x00  character whose ordinal is the given hexadecimal number
 - [x] \o{}, \000  character whose ordinal is the given octal number

Quantifiers (Complete)

 - [x] *           Match 0 or more times
 - [x] +           Match 1 or more times
 - [x] ?           Match 1 or 0 times
 - [x] {n}         Match exactly n times
 - [x] {n,}        Match at least n times
 - [x] {,n}        Match at most n times
 - [x] {n,m}       Match at least n but not more than m times
 - [x] *?        Match 0 or more times, not greedily
 - [x] +?        Match 1 or more times, not greedily
 - [x] ??        Match 0 or 1 time, not greedily
 - [x] {n}?      Match exactly n times, not greedily (redundant)
 - [x] {n,}?     Match at least n times, not greedily
 - [x] {,n}?     Match at most n times, not greedily
 - [x] {n,m}?    Match at least n but not more than m times, not greedily
 - [x] *+     Match 0 or more times and give nothing back
 - [x] ++     Match 1 or more times and give nothing back
 - [x] ?+     Match 0 or 1 time and give nothing back
 - [x] {n}+   Match exactly n times and give nothing back (redundant)
 - [x] {n,}+  Match at least n times and give nothing back
 - [x] {,n}+  Match at most n times and give nothing back
 - [x] {n,m}+ Match at least n but not more than m times and give nothing back

Character Classes and other Special Escapes (Complete)

 - [x] [...]     [1]  Match a character according to the rules of the
                    bracketed character class defined by the "...".
                    Example: [a-z] matches "a" or "b" or "c" ... or "z"
 - [x] [[:...:]] [2]  Match a character according to the rules of the POSIX
                    character class "..." within the outer bracketed
                    character class.  Example: [[:upper:]] matches any
                    uppercase character.
 - [x] \g1       [5]  Backreference to a specific or previous group,
 - [x] \g{-1}    [5]  The number may be negative indicating a relative
                  previous group and may optionally be wrapped in
                  curly brackets for safer parsing.
 - [x] \g{name}  [5]  Named backreference
 - [x] \k<name>  [5]  Named backreference
 - [x] \k'name'  [5]  Named backreference
 - [x] \k{name}  [5]  Named backreference
 - [x] \w        [3]  Match a "word" character (alphanumeric plus "_", plus
                    other connector punctuation chars plus Unicode
                    marks)
 - [x] \W        [3]  Match a non-"word" character
 - [x] \s        [3]  Match a whitespace character
 - [x] \S        [3]  Match a non-whitespace character
 - [x] \d        [3]  Match a decimal digit character
 - [x] \D        [3]  Match a non-digit character
 - [x] \v        [3]  Vertical whitespace
 - [x] \V        [3]  Not vertical whitespace
 - [x] \h        [3]  Horizontal whitespace
 - [x] \H        [3]  Not horizontal whitespace
 - [x] \1        [5]  Backreference to a specific capture group or buffer.
                    '1' may actually be any positive integer.
 - [x] \N        [7]  Any character but \n.  Not affected by /s modifier
 - [x] \K        [6]  Keep the stuff left of the \K, don't include it in $&

Assertions

 - [x] \b     Match a \w\W or \W\w boundary
 - [x] \B     Match except at a \w\W or \W\w boundary
 - [x] \A     Match only at beginning of string
 - [x] \Z     Match only at end of string, or before newline at the end
 - [x] \z     Match only at end of string
 - [ ] \G     Match only at pos() (e.g. at the end-of-match position
          of prior m//g)

Capture groups (Complete)

 - [x] (...)

Quoting metacharacters (Complete)

 - [x] For ^.[]$()*{}?+|\

Extended Patterns

 - [x] (?<NAME>pattern)            Named capture group
 - [x] (?#text)                    Comments
 - [x] (?adlupimnsx-imnsx)         Modification for surrounding context
 - [x] (?^alupimnsx)               Modification for surrounding context
 - [x] (?:pattern)                 Clustering, does not generate a group index.
 - [x] (?adluimnsx-imnsx:pattern)  Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?^aluimnsx:pattern)        Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?|pattern)                 Branch reset
 - [x] (?'NAME'pattern)            Named capture group
 - [ ] (?(condition)yes-pattern|no-pattern)  Conditional patterns.
 - [ ] (?(condition)yes-pattern)             Conditional patterns.
 - [ ] (?>pattern)                 Atomic patterns. (Disable backtrack.)
 - [ ] (*atomic:pattern)           Atomic patterns. (Disable backtrack.)

Lookaround Assertions

 - [x] (?=pattern)                     Positive look ahead.
 - [x] (*pla:pattern)                  Positive look ahead.
 - [x] (*positive_lookahead:pattern)   Positive look ahead.
 - [x] (?!pattern)                     Negative look ahead.
 - [x] (*nla:pattern)                  Negative look ahead.
 - [x] (*negative_lookahead:pattern)   Negative look ahead.
 - [ ] (?<=pattern)                    Positive look behind.
 - [ ] (*plb:pattern)                  Positive look behind.
 - [ ] (*positive_lookbehind:pattern)  Positive look behind.
 - [ ] (?<!pattern)                    Negative look behind.
 - [ ] (*nlb:pattern)                  Negative look behind.
 - [ ] (*negative_lookbehind:pattern)  Negative look behind.

Special Backtracking Control Verbs

 - [ ] (*PRUNE) (*PRUNE:NAME)   No backtracking over this point.
 - [ ] (*SKIP) (*SKIP:NAME)     Start next search here.
 - [ ] (*MARK:NAME) (*:NAME)    Place a named mark.
 - [ ] (*THEN) (*THEN:NAME)     Like PRUNE.
 - [ ] (*COMMIT) (*COMMIT:arg)  Stop searching.
 - [ ] (*FAIL) (*F) (*FAIL:arg) Fail the pattern/branch.
 - [ ] (*ACCEPT) (*ACCEPT:arg)  Accept the pattern/subpattern.

Not planned (Mainly because of Unicode or perl specifics)

Modifiers

 - [ ] p                Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching.
 - [ ] a, d, l, and u   These modifiers, all new in 5.14, affect which character-set rules (Unicode, etc.) are used, as described below in "Character set modifiers".
 - [ ] g                globally match the pattern repeatedly in the string
 - [ ] e                evaluate the right-hand side as an expression
 - [ ] ee               evaluate the right side as a string then eval the result
 - [ ] o                pretend to optimize your code, but actually introduce bugs
 - [ ] r                perform non-destructive substitution and return the new value

Escape sequences

 - [ ] \cK         control char          (example: VT)
 - [ ] \N{name}    named Unicode character or character sequence
 - [ ] \N{U+263D}  Unicode character     (example: FIRST QUARTER MOON)
 - [ ] \l          lowercase next char (think vi)
 - [ ] \u          uppercase next char (think vi)
 - [ ] \L          lowercase until \E (think vi)
 - [ ] \U          uppercase until \E (think vi)
 - [ ] \Q          quote (disable) pattern metacharacters until \E
 - [ ] \E          end either case modification or quoted section, think vi

Character Classes and other Special Escapes

 - [ ]  (?[...])  [8]  Extended bracketed character class
 - [ ] \pP       [3]  Match P, named property.  Use \p{Prop} for longer names
 - [ ] \PP       [3]  Match non-P
 - [ ] \X        [4]  Match Unicode "eXtended grapheme cluster"
 - [ ] \R        [4]  Linebreak

Assertions

 - [ ] \b{}   Match at Unicode boundary of specified type
 - [ ] \B{}   Match where corresponding \b{} doesn't match

Extended Patterns

 - [ ] (?{ code })                 Perl code execution.
 - [ ] (*{ code })                 Perl code execution.
 - [ ] (??{ code })                Perl code execution.
 - [ ] (?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)       Recursive subpattern.
 - [ ] (?&NAME)                   Recursive subpattern.

Script runs

 - [ ] (*script_run:pattern)         All chars in pattern need to be of the same script.
 - [ ] (*sr:pattern)                 All chars in pattern need to be of the same script.
 - [ ] (*atomic_script_run:pattern)  Without backtracking.
 - [ ] (*asr:pattern)                Without backtracking.

Dec 21 '23 19:12 MaxSagebaum

I am aiming to implement the POSIX extended specification with a few extras from perl. All in all I am sticking to the perl interpretation of regular expressions.

Dec 21 '23 19:12 MaxSagebaum

The feature set is now complete as stated in https://en.wikipedia.org/wiki/Regular_expression. I grabbed the test suite from https://wiki.haskell.org/Regex_Posix https://hackage.haskell.org/package/regex-posix-unittest. I am currently working my way through the tests by fixing all the corner cases. Especially, the greed nature of * and the backtracking if grabbed to much might require a rework of the matching logic.

Dec 28 '23 12:12 MaxSagebaum

I finished now the basic implementation and most of the test suite is passed.

Some notable differences with respect to posix ERE:

Alternatives are not greedy: a|ab matches a and not ab in ab.
The match is case sensitive.

I am now looking into performance tests and I will clean up the code.

Jan 03 '24 14:01 MaxSagebaum

I did some basic performance testing. For this I used the benchmark-exec from https://github.com/hanickadot/compile-time-regular-expressions and extended it to include the regular expression metafunction from cppfront. The first result where:

100 mb file:
ctre: 409 ms
cppfront base (all runtime checks) :           80711 ms
cppfront no runtime checks:                    15783 ms (options: -no-c -no-n -no-s)
cppfront no runtime checks & no copy of state:  3021 ms

As you can see, the runtime checks and the copy of the regex state really hit me. But nevertheless, the result was quite underwhelming. My implementation is 8 times slower than ctre.

Since both are using templates to build up the regular expression, I was wondering and took a closer look. ctre always provides the remainder of the regular expression to each matcher. That is, each matcher can verify if his match is a valid form of the whole regular expression. I did not do this and therefore I had to keep the state of the matchers and some ways to restore it. This was very costly. Fortunately I could adapt my implementation quite easily.

The rewritten version looks now better:

100 mb file
ctre:                           409 ms
cppfront no runtime checks:     250 ms (options: -no-c -no-n -no-s
cppfront with runtime checks: 10918 ms

The runtime checks still hit quite hard. I would not advertise that my implementation is faster than ctre. It is still quite rudimentary and e.g. can not be configured, to ignore case. Therefore, I would only say, that they in the same performance region.

The performance test for on a 4GB file looks similar:

"boost::regex";"ABCD|DEFGH|EFGHI|A{4,}"; 115040 ms
"cppfront";"ABCD|DEFGH|EFGHI|A{4,}";       5154 ms
"CTRE";"ABCD|DEFGH|EFGHI|A{4,}";           8959 ms
"PCRE2";"ABCD|DEFGH|EFGHI|A{4,}";         18784 ms
"PCRE2 (jit)";"ABCD|DEFGH|EFGHI|A{4,}";    5003 ms
"RE2";"ABCD|DEFGH|EFGHI|A{4,}";            9113 ms
"srell";"ABCD|DEFGH|EFGHI|A{4,}";         17065 ms
"std::regex";"ABCD|DEFGH|EFGHI|A{4,}";   110107 ms

Compilation times:

(The pure2-regex.cpp2 file.)
428 regular expressions.
cppfront:
cpp2 -> cpp: 1.37 s
cpp -> exe: 23.82 s

ctre:
cpp -> exe: 733.12 s

I think, we have a real benefit here. With the metafunction, we can parse the regular expression with regular code that builds up the templates. The compiler only needs to compile the template and not parse it.

Now with the performance tests done, I can finally clean up the code. ;)

Jan 06 '24 14:01 MaxSagebaum

I did a mistake in the cppfront recompilation with the runtime checks. Updated the value from 350 ms to 10918 ms.

Jan 06 '24 15:01 MaxSagebaum

Thank you for looking into this.

The initial implementation declares more templates that I wished for (0). With https://github.com/hsutter/cppfront/discussions/797#discussioncomment-7759206, I literally meant lowering the match for /(a|b)/ to return s == 'a' || s == 'b';. For #514, that would mean lowering the match for ^alignas|^alignof|^asm|^as|^auto|… to the merged call to std::find_if.

I understand that is sub-optimal to generalize, as there are better algorithms for string matching. But I think it would be best to start with build-time performance in mind.

The performance test for on a 4GB file looks similar:

Can you explain this in more detail? What is ABCD|DEFGH|EFGHI|A{4,}? What's the regex that is taking CTRE 733 s to compile and @regex 24 s?

Jan 06 '24 15:01 JohelEGP

The initial implementation declares more templates that I wished for (0). With #797 (comment), I literally meant lowering the match for /(a|b)/ to return s == 'a' || s == 'b';. For #514, that would mean lowering the match for ^alignas|^alignof|^asm|^as|^auto|… to the merged call to std::find_if.

Ok, now I understand you and I also get the implication for compile times. The problem is, that this will be very ugly for larger expressions. Also ranges will be quite hard to handle in such a way.

The performance test for on a 4GB file looks similar:

Can you explain this in more detail? What is ABCD|DEFGH|EFGHI|A{4,}? What's the regex that is taking CTRE 733 s to compile and @regex 24 s?

ABCD|DEFGH|EFGHI|A{4,} was the regex that was matched on the file. It was generated with

dd if=/dev/urandom bs=2147483648 count=1 | base64 > test.txt

The ms results are the runtime to search for the first match in each file. All in all, there are only 137 matches or so. Therefore, no early out is possible and the regular expression need to scan the whole line.

The compilation times are for the pure2-regex.cpp2 file which includes 428 regular expressions. I adapted this file so that ctre can also run with it. The rough approximation for the compile time per regular expression would be:

cppfront cpp2 -> cpp: 0.003 seconds per regular expression
cppfront cpp -> exe 0.05 seconds per regular expression
ctre 1.71 seconds per regular expression

Although the number for ctre seems quite high. She reported on a compilation time of 0.1 seconds per regular expressions. Maybe there is one regular expression that is quite intensive on the compiler.

Jan 06 '24 16:01 MaxSagebaum

It seems like your solution is doing great in terms of performance.

Jan 06 '24 18:01 JohelEGP

I did now the cleanup of the code.

I consider this now finished from my side. I would appreciate a review and suggestions on the final implementation/solutions. I want to squash the history into one commit.

Some questions that are still open from my side:

[ ] Is the string_util.h ok where it is or should it be moved.
[x] Is the include #include "../source/regex.h" in cpp2util.h ok?

Some things I still need to do:

[ ] Check the todos in regex.h2 and raise issues or reference existing issues.
[x] Write the documentation for regex in reflect.h2. (Missed that on my last commit.)
[x] Fix github regression tests.
[ ] Squash all commits. (I think the history is not that useful.)

Jan 12 '24 15:01 MaxSagebaum

Is the include #include "../source/regex.h" in cpp2util.h ok?

That's not OK by convention.

Jan 12 '24 15:01 JohelEGP

Is the include #include "../source/regex.h" in cpp2util.h ok?

That's not OK by convention.

What would be the correct way to solve this? Move source/regex.h to include and create a link to source? Create a link to include? Should the same be done for reflect.h?

Jan 17 '24 20:01 MaxSagebaum

Is the include #include "../source/regex.h" in cpp2util.h ok?

That's not OK by convention.

What would be the correct way to solve this? Move source/regex.h to include and create a link to source? Create a link to include? Should the same be done for reflect.h?

Like I do for cpp2reflect.h in #907. Move source/regex.h to include/cpp2regex.h. If something in source/ wants to #include "cpp2regex.h", you can do like source/cpp2util.h.

Jan 17 '24 21:01 JohelEGP

I added the cpp2regex.h to include.

Jan 22 '24 09:01 MaxSagebaum

Done.

Jan 25 '24 07:01 MaxSagebaum

Thanks again for this. Questions:

There are 126 "Failure:" in the output. Is that intentional?
Is being non-greedy a correctness issue?

Jan 25 '24 21:01 hsutter

I just count 63 "Failure". I think you counted two files. ;-)

Just a few notes first:

Implementing the non greedy version is much simpler.
It is basically a convention. Perl regular expressions have the alternative as non greedy.

So the remaining failure cases are just due to the non greedy nature of the alternative match. I though about it in the meantime and thought that I had solution. I tried to implement it. Oh, how I was wrong. But, 3 hours later I have a working version of the greedy alternative match. This is actually a hit on performance, since it is much more involved:

100 Mb text file

"cppfront: greedy-alternative";"ABCD|DEFGH|EFGHI|A{4,}";       251 ms
"cppfront: non-greedy-alternative";"ABCD|DEFGH|EFGHI|A{4,}";  1238 ms
"CTRE";"ABCD|DEFGH|EFGHI|A{4,}";                               411 ms

The greedy nature reduces the "Failures" to 40. The remaining ones seem to come from the interaction of the ranges and alternatives. The best example is class::9. I need to think about this a little bit more, but I do not think that I can change the implementation to make this work. If you want I can elaborate what goes wrong and why it is not possible.

Summary: The failures come from the non-greedy nature and the interaction with the ranges match. This are mostly really hard corner cases. So the regex implementation should be good enough.

Since the greedy nature really hits performance, I made it a compile time constant and left the old behavior

Jan 26 '24 18:01 MaxSagebaum

Thanks! Sorry for the double-counting, I just hit Ctrl-F and let Chrome tell me the total #hits. 😄

Making greediness an option sounds perfect -- then it's don't-pay-for-what-you-don't-use, and it can be compared to both greedy and non-greedy alternative implementations.

When you say interaction of ranges and alternatives, do you mean examples like a-f|e-j, or something else?

Understood about being good enough, if the failure cases are rare edge cases -- other regex implementations have bugs too (including all the three major standard library implementations, I think). As long as it's in the neighborhood of being as complete as others it can be compared, and I'm very interested in doing comparisons.

Here's a litmus test: I was thinking of using this as an example in a talk in April. Would I be credible or criticized if I used this as an example on stage, and compared it to other regex implementations to demonstrate the benefits of a source code generation approach? Is it good enough to be an apples-to-apples comparison, or would there be legitimate "but it takes shortcuts / is only a partial implementaiton" objections?

Also, if we merge this now, what opportunities are there in the next ~2 months to further improve one or more of { run time, compile time, completeness/accuracy-if-needed } ?

Thanks again for this, it's very interesting and already showing strong progress. Much appreciated!

Jan 26 '24 18:01 hsutter

Thanks! Sorry for the double-counting, I just hit Ctrl-F and let Chrome tell me the total #hits. 😄

No problem. ;-)

Understood about being good enough, if the failure cases are rare edge cases -- other regex implementations have bugs too (including all the three major standard library implementations, I think). As long as it's in the neighborhood of being as complete as others it can be compared, and I'm very interested in doing comparisons.

I would also like to do comparisons on a larger basis and I think the implementation is already ready for this.

Here's a litmus test: I was thinking of using this as an example in a talk in April. Would I be credible or criticized if I used this as an example on stage, and compared it to other regex implementations to demonstrate the benefits of a source code generation approach? Is it good enough to be an apples-to-apples comparison, or would there be legitimate "but it takes shortcuts / is only a partial implementaiton" objections?

The shortcuts I currently take are just minor ones and could be fixed in a reasonable short time. Using this in a demo should therefore not be a problem. Objections from a general listener should not arise, only regex experts might have objects to the details but should be aware that this is just an example. In order to minimize objections, I would propose two parts of the demo:

Show a simple enough regular expression and compare the performance.
Show a very involved regular expression no body understands and show that it works and compare the performance. (Would have to look for such one myself.)

On point that is very important would be to cite the work of Hana Dusíková with Compile time regular expressions (https://github.com/hanickadot/compile-time-regular-expressions) since this implementation is based/inspired by her work. She also mentioned in one of her last talks that she wants to explore a better way for the compile-time generation.

Also, if we merge this now, what opportunities are there in the next ~2 months to further improve one or more of { run time, compile time, completeness/accuracy-if-needed } ?

Short answer: Yes.

run time: maybe but probably not,
compile time: maybe (I would love to get some hints on how to measure the performance of the compiler)
completeness: yes

Long answer: With regard to the explosion of possible matches I describe below, I would go away from the POSIX ERE compatible implementation to a perl conforming implementation. (Here the alternative | returns the first possible match.) Most of the basic missing perl features could be implemented quite easily. The more advanced ones like look ahead and look behind could take a bit, but are not strictly required.

I would do the chagen, which I would do like this:

Write a converter for the perl regular expression test suite: https://perl5.git.perl.org/perl5.git/tree/HEAD:/t/re and replace the current test suite.
Add the basic perl syntax e.g. \n, *?, *+.
Work on more advanced perl features. (This may take some time and would be a project for the year.)

I would say that like 95% of the perl regex syntax could be done in the next two month. Everything else would require more time. (Just a very rough estimate.)

When you say interaction of ranges and alternatives, do you mean examples like a-f|e-j, or something else?

I was referring to this expression: (aba|a*b)* the test says it should match ababa with the groups: 0 = 0-5 is ababa, and 1 = 2-5 is aba. We have the alternative aba|a*b and two ranges *. Posix requires the alternative to be greedy and the whole match to be greedy. That is, the alternative must choose the one that yields the longest match in general. In the example above first we takeaba. A second aba does not match so we take a*b which only matches b. A third successful match is not possible, so the result is abab which has length of 4. Then we also need to consider a*b for the first iteration/alternative and match ab. The second iteration can now match aba which matches the whole string and has a length of 5. This means the first alternative does not take the longest submatch aba since the shorter submatch ab produces an overall longer match.

Even with (aba|ab)* the number of possibilities is 2^n where n defines how often * can find a match. And all possiblites need to be tried.

Our implementation has the problem, that we do not provide the expression tail to the inner expression of the *. Therefore, it can not decide what is the longest overall match. Simplified it looks like:

r:= M::match(cur, ctx, no_tail<CharT>()); // Match the inner expression without any tail. E.g. 'aba|a*b'
o:= ...; // Check if recursively if 'M' matches again and the tail
if o.matched {
  return o 
}
else if r.matched{ // No other match, try to match the tail.
  return Tail::match(cur, ctx);
}
return false;

I though on how this could be changed, but did not came to a fully satisfying answer. Either we would have an infinite recursion in the compiler. Or we would need an extensive state management in for the ranges. This could be implemented by moving everything from static functions to non static ones, but I am not really sure if it really would solve the problem and how much performance it would cost. I might try this if I have a lot of time left.

Jan 29 '24 10:01 MaxSagebaum

I compiled a list of the perl regex features and started implementing them last weekend. Most of the main features are finished and I am using the perl regex test suite to validate the implementation. My next goal is to implement the modifiers and afterwards the extended patterns.

From now on I will gather the information in the first post so that a quick summary of the status can be grasped. But I will also ping updates in comments.

Feb 21 '24 08:02 MaxSagebaum

I did an update on Friday. The initial post is updated. The new features are:

Modifiers

 - [x] m                Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
 - [x] s                Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.

Extended Patterns

- [x] (?adlupimnsx-imnsx)         Modification for surrounding context
- [x] (?^alupimnsx)               Modification for surrounding context
- [x] (?:pattern)                 Clustering, does not generate a group index.
- [x] (?adluimnsx-imnsx:pattern)  Clustering, does not generate a group index and modifications for the cluster.
- [x] (?^aluimnsx:pattern)        Clustering, does not generate a group index and modifications for the cluster.

Mar 04 '24 07:03 MaxSagebaum

A further update. The initial post is updated. The new features are:

Modifiers

 - [x] x and xx         Extend your pattern's legibility by permitting whitespace and comments. Details in "/x and /xx"
 - [x] n                Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.

Extended Patterns

 - [x] (?#text)                    Comments
 - [x] (?|pattern)                 Branch reset
 - [x] (?'NAME'pattern)            Named capture group

Mar 07 '24 16:03 MaxSagebaum

A currently final update. I added now the lookahead functions.

Escape sequences

 - [x] \x{}, \x00  character whose ordinal is the given hexadecimal number
 - [x] \o{}, \000  character whose ordinal is the given octal number

Lookaround Assertions

 - [x] (?=pattern)                     Positive look ahead.
 - [x] (*pla:pattern)                  Positive look ahead.
 - [x] (*positive_lookahead:pattern)   Positive look ahead.
 - [x] (?!pattern)                     Negative look ahead.
 - [x] (*nla:pattern)                  Negative look ahead.
 - [x] (*negative_lookahead:pattern)   Negative look ahead.

I moved these to will not do, since they are modifiers on the replacement string.

Escape sequences

 - [ ] \l          lowercase next char (think vi)
 - [ ] \u          uppercase next char (think vi)
 - [ ] \L          lowercase until \E (think vi)
 - [ ] \U          uppercase until \E (think vi)
 - [ ] \Q          quote (disable) pattern metacharacters until \E
 - [ ] \E          end either case modification or quoted section, think vi

Mar 10 '24 20:03 MaxSagebaum

I consider the implementation now finished. The major features are implemented and the remaining ones should not change the basic design that much. This is also because I have not that much time in the next 2 months.

Mar 10 '24 20:03 MaxSagebaum

Thanks! I'll start taking a pass over the failing tests...

Mar 10 '24 21:03 hsutter

Thanks. I planed to do that after the review/I am back from a vacation this week.

Mar 10 '24 21:03 MaxSagebaum

[corrected: somehow I initially got a non-current PR]

Thanks! Initial comment: This might be the first major PR that compiles totally clean on the first try for me including at high Cpp1 warning levels. (Almost: MSVC had two unused name warnings, and Clang 12 couldn't handle std::format). Nice -- that's hard to do unless regularly building with all major compilers locally, which understandably not everyone has available.

Mar 19 '24 16:03 hsutter

I've taken a first pass through and have some commits to push -- please hold any new commits until I can fix my branch(*) and push those, thanks!

(*) boring reasons: because it seems GitHub Desktop got confused with conflicts (which seems impossible since there haven't been any other pushes today) and the solution that seems to work is to reset the branch and reapply the changes but that will take me a little more time

Mar 19 '24 19:03 hsutter

OK, I've pushed all the commits I have for now from the first review.

✅ Cppfront builds clean
⚠️ The test case file pure2-regex.cpp2 generates cppfront-compile-time metafunction errors for some of the cases.

I've created a (temporary) pure2-regex-partial.cpp2 which is the same as pure2-regex.cpp2 with all the cases commented out that generated metafunction errors. This compiles clean in cppfront, but I still get strange errors from the Cpp1 compiler, which I've narrowed down to this short repro:

// This is a complete file, that doesn't compile using MSVC.current or GCC 10 ?
#define CPP2_IMPORT_STD          Yes
#include "cpp2util.h"

If the macro is commented out, things compile fine. So there's something about the "import std" path that's going wrong, apparently on both MSVC and GCC 10.

I haven't been able to diagnose the problem further than that though.

That's pretty much all I have for the first review -- please let me know what you think. Thanks again!

Mar 20 '24 00:03 hsutter

Oh, one more thing: In the generated pure2-regex-partial.cpp, I noticed that the longest line is 397,231 characters long. I wonder if that line length would make any tools hiccup? (Thinking out loud: It should be doable to put some line breaks in there as needed, I can look into that.)

Mar 20 '24 00:03 hsutter

Odd. Now I see GCC 10 and Clang 12 work fine for me with that minimal repro, and I realized that at least part of it seems to be bugs in MSVC modules, that can't compile this:

import std.compat;
#include <string>   // via string_util.h

or this

#include <string>
import std.compat;

These are known problems. The solution is to not try to do both import and #include.

A group of questions, not particularly related:

Does string_util.h need to be a separate file, or can its contents go into cpp2util.h or even common.h?
Do programs that have been compiled with @regex need any features in string_util.h? I did some searching and it seems those are used only in regex.h2 itself, is that right?
Also, I noticed that I probably shouldn't have generated source/regex.h... I'm now thinking you're generating include/cpp2regex.h from /source/regex.h2, is that right?

Hold on changes please while I take another pass over this now... thanks!

Mar 21 '24 01:03 hsutter