Backreferences to non-participating groups are handled incorrectly
Issue extracted from #21.
E.g. ruby does this:
/\1()/.match?('foo') # => false
but the resulting js regex, with the same source, produces a zero-length match on any string.
A few other examples of regexes that include nonparticipating capturing groups:
-
(\1)- should fail to match -
(a)|b\1- should never match the second alternative -
(?<n>a)|(?<n>)\k<n>- the backreference should not allow matching an empty string as one of the multiplexed options (see #28)
Might include a nonparticipating capturing group:
-
(a)?b\1 -
((a)|b)+\2
Never includes a nonparticipating capturing group:
-
(a?)b\1
These cases are handled correctly now:
-
\1() -
(\1) -
(a)|b\1 -
(?<n>a)|(?<n>)\k<n> -
(a?)b\1
What remains to be solved are potential non-participating groups like (a)?b\1 and ((a)|b)+\2.
To quickly recapitulate the issue:
In Ruby/Onigmo/Oniguruma, regexps only match if every backreferenced group matches at least once:
/((a)|b)+\2/.match?('aa') # => true
/((a)|b)+\2/.match?('bb') # => false
/(a)?b\1/.match?('aba') # => true
/(a)?b\1/.match?('bbb') # => false
JS is more permissive and ignores backreferences to non-participating groups:
/((a)|b)+\2/.test('aa') // true
/((a)|b)+\2/.test('bb') // true
/(a)?b\1/.test('aba') // true
/(a)?b\1/.test('bbb') // true
Hence the challenge is to generate a JS regexp that is less permissive than a verbatim copy.
Maybe a lookahead in which the required part is enforced and in which all groups at any depth become passive could be used.
E.g. ((a)|(b)|c|d)+\3 could perhaps become this (whitespace only added for clarity):
(?:
(?=
(?:(?:a)|(?:b)|c|d)*
(?:(?:b))
(?:(?:a)|(?:b)|c|d)*
)
((a)|(b)|c|d)+\3
)
Things get more complicated if we consider that the backref itself could also be optional or on a branch, like in this example:
((a)|(b)|c|d)+(?:e|\3)
Another, and perhaps more consistent, option for such "combinatorial" cases could be to generate permutations of the whole pattern as this gem already does for conditionals.
As an aside, conditional branches are likely another situation that give rise to this issue, in addition to the cases involving alternation branches and quantifiers with a minimum below 1 illustrated above.