Incorrect quantifier merging in the optimizer
The "quantifier merge transform" used by the optimizer is sometimes incorrect: a regex can be rewritten to another regex that is not equivalent.
- It it sometimes incorrect to rewrite
r{0,m}r{n}intor{n,m+n}.
Counter-example:
/(?:a|ab){0,1}(?:a|ab){1}/ is not equivalent to /(?:a|ab){1,2}/: they match different things on the string "aba".
- For the same reason, it is incorrect to rewrite
r*r{n}intor{n,}.
Counter-example:
/(?:a|ab)*(?:a|ab){1}/ is not equivalent to /(?:a|ab){1,}/: they match different things on the string "aba".
- It is sometimes incorrect to rewrite
r{0,n}?r{0,m}?intor{0,n+m}?.
Counter-example:
/(?:ab|aba){0,1}?(?:ab|aba){0,1}?[bc]/ is not equivalent to /(?:ab|aba){0,2}?[bc]/: they match different things on the string "ababc".
However, regexp-tree's optimizer module performs these transformations.
Thanks to Eugène Flesselle @eugeneflesselle for finding these counter-examples while working on proving the correctness of regex transformations.
We believe that these transformations are however correct when r is unambiguous, meaning that for each string, there is only one way of matching /^r/.
Maybe regexp-tree could check the ambiguity of the regex before applying these transformations.
@Aurele-Barriere thanks for reporting the issue (and @EugeneFlesselle for spotting it). Trying to keep balance between sophisticated and complicated optimizer which can handle all possible edge cases, and a simpler one. How widespread is this construct? It'd be great to fix it anyways though. I'll appreciate a pull request on this if you reach it earlier.