pomsky icon indicating copy to clipboard operation
pomsky copied to clipboard

Support extended POSIX regexes

Open Aloso opened this issue 3 years ago • 8 comments

Aloso avatar Mar 11 '22 15:03 Aloso

There are many devices that only support ERE, We need it.

wy16W2pIilK1xgqN avatar Sep 30 '22 08:09 wy16W2pIilK1xgqN

@wy16W2pIilK1xgqN could you explain? What devices are they?

Aloso avatar Oct 01 '22 14:10 Aloso

A lot , routers and firewalls, For example, all devices of MikroTik

wy16W2pIilK1xgqN avatar Oct 01 '22 15:10 wy16W2pIilK1xgqN

The problem is that ERE doesn't support non-capturing groups, like

("hello"? | "world"+) "!!"

which compiles to

(?:(?:hello)?|(?:world)+)!!

For ERE, this would have to compile to

((hello)?|(world)+)!!

But this is not equivalent, because it changes the capturing group indexes. So we either need an option to never emit non-capturing groups when compiling to ERE, or we need to make the above code illegal, requiring capturing groups like this:

:(:("hello")? | :("world")+) "!!"

Although the outer capturing group could be avoided by "inlining" the exclamation mark:

(:("hello")? | :("world")+) "!!"
(hello)?!!|(world)+!!

But that could lead to exponential size increase of the generated expression, so probably not a good idea.

Aloso avatar Oct 01 '22 16:10 Aloso

The other problem is that ERE does not allow escaping characters within a character class, so characters need to be rearranged:

['^' 'a'-'z' '\' '-' ']']

will have to be compiled to

[]^a-z\-]

Rules:

  • The literal ^ can't appear at the start
  • The literal ] can only appear at the start
  • The literal - can only appear at the start or end

Aloso avatar Oct 01 '22 16:10 Aloso

Another problem: Codepoint/C doesn't work (it compiles to [\s\S], which is not supported in ERE), so what are the alternatives?

  • Allow the dot instead (matches anything except line breaks by default; line breaks are included in multiline mode)
  • Compile C to ., but that would change the behavior of the pomsky expression depending on the flavor; not good
  • Compile C to (.|\s), but that can lead to catastrophic backtracking; also, \s is supported by GNU ERE but not POSIX ERE; not good

Aloso avatar Oct 01 '22 16:10 Aloso

The dot is now supported as of Pomsky 0.8. Rewriting the code for compiling character classes is in progress, with the goal of eventually supporting ERE. The only open question right now is how to handle non-capturing groups. Any input for this would be appreciated!

Possibilities are:

  1. disallow non-capturing groups when targeting ERE, requiring users to write :() instead

  2. add an option to silently convert non-capturing groups to capturing groups when targeting ERE; this could be made configurable, e.g. with -Xcapture=always

Both have disadvantages (1. makes pomsky expressions less portable, but 2. makes behavior of pomsky expressions less predictable).

Aloso avatar Dec 28 '22 15:12 Aloso