regexp-tree icon indicating copy to clipboard operation
regexp-tree copied to clipboard

Incorrect quantifier merging in the optimizer

Open Aurele-Barriere opened this issue 11 months ago • 1 comments

The "quantifier merge transform" used by the optimizer is sometimes incorrect: a regex can be rewritten to another regex that is not equivalent.

  • It it sometimes incorrect to rewrite r{0,m}r{n} into r{n,m+n}.

Counter-example: /(?:a|ab){0,1}(?:a|ab){1}/ is not equivalent to /(?:a|ab){1,2}/: they match different things on the string "aba".

  • For the same reason, it is incorrect to rewrite r*r{n} into r{n,}.

Counter-example: /(?:a|ab)*(?:a|ab){1}/ is not equivalent to /(?:a|ab){1,}/: they match different things on the string "aba".

  • It is sometimes incorrect to rewrite r{0,n}?r{0,m}? into r{0,n+m}?.

Counter-example: /(?:ab|aba){0,1}?(?:ab|aba){0,1}?[bc]/ is not equivalent to /(?:ab|aba){0,2}?[bc]/: they match different things on the string "ababc".

However, regexp-tree's optimizer module performs these transformations.

Thanks to Eugène Flesselle @eugeneflesselle for finding these counter-examples while working on proving the correctness of regex transformations. We believe that these transformations are however correct when r is unambiguous, meaning that for each string, there is only one way of matching /^r/. Maybe regexp-tree could check the ambiguity of the regex before applying these transformations.

Aurele-Barriere avatar Dec 27 '24 14:12 Aurele-Barriere

@Aurele-Barriere thanks for reporting the issue (and @EugeneFlesselle for spotting it). Trying to keep balance between sophisticated and complicated optimizer which can handle all possible edge cases, and a simpler one. How widespread is this construct? It'd be great to fix it anyways though. I'll appreciate a pull request on this if you reach it earlier.

DmitrySoshnikov avatar Dec 28 '24 02:12 DmitrySoshnikov