libfsm
libfsm copied to clipboard
Missing handling for non-escaped { } literals in pcre dialect
https://twitter.com/JakeDChampion/status/1282973512593018880
This case shows { and } near the beginning, and these are literal characters and not escaped.
/\s*(?:{(.*)})?\s*(?:(\$?\S+))?\s*(?:\[([^\]]*)])?\s*-?\s*([\S\s]*)\s*$/
I supposed the first would be distinguished from the x{m,n} repetition syntax because it doesn't follow an atom. And then I guess the second is seen as non-special because by that point we're not in the middle of a {...} lexical region.
libre currently gives a syntax error here, but pcregrep accepts this.
Here's where the spec states that this is okay: http://www.pcre.org/current/doc/html/pcre2pattern.html#SEC17
An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.
I believe this means that a{foo,6} is treated the same as a\{foo,6\} because {foo,6} "does not match the syntax of a quantifer."
Perhaps we can use SID's exception-handling alt for this. Where normally we'd raise an error for an invalid count production, but there's no reason ## has to be used to raise an error. I think we can have that produce a concatenation of literals instead.
Possibly. It's interesting to consider forcing sid into a backtracking parser.