libfsm icon indicating copy to clipboard operation
libfsm copied to clipboard

Missing handling for non-escaped { } literals in pcre dialect

Open katef opened this issue 5 years ago • 3 comments

https://twitter.com/JakeDChampion/status/1282973512593018880

This case shows { and } near the beginning, and these are literal characters and not escaped.

/\s*(?:{(.*)})?\s*(?:(\$?\S+))?\s*(?:\[([^\]]*)])?\s*-?\s*([\S\s]*)\s*$/

I supposed the first would be distinguished from the x{m,n} repetition syntax because it doesn't follow an atom. And then I guess the second is seen as non-special because by that point we're not in the middle of a {...} lexical region.

libre currently gives a syntax error here, but pcregrep accepts this.

katef avatar Jul 14 '20 18:07 katef

Here's where the spec states that this is okay: http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC17

An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.

I believe this means that a{foo,6} is treated the same as a\{foo,6\} because {foo,6} "does not match the syntax of a quantifer."

sfstewman avatar Jul 14 '20 19:07 sfstewman

Perhaps we can use SID's exception-handling alt for this. Where normally we'd raise an error for an invalid count production, but there's no reason ## has to be used to raise an error. I think we can have that produce a concatenation of literals instead.

katef avatar Jul 18 '20 18:07 katef

Possibly. It's interesting to consider forcing sid into a backtracking parser.

sfstewman avatar Jul 18 '20 19:07 sfstewman