regexp-tree
regexp-tree copied to clipboard
\s optimization changes runtime semantics
As per ECMAScript spec, \s
matches any WhiteSpace or LineTerminator character:
Currently optimizer docs say it transforms [ \t\r\n\f]
to \s
(and the opposite), but this changes runtime semantics of the RegExp as in fact \s
matches many more characters:
I don't know if this is an intentional deviation and these are considered as edge cases that should be ignored, but it would be nice to make such "loose" transformations at least optional under a flag, as otherwise an optimized regexp can behave differently than original one, which is dangerous to use as an ESLint plugin or a transpiler.
Good catch! In practice though, as you mentioned, I'd consider it as an edge case. The "loose" mode sounds interesting, or we can just handle a white/black list of needed transforms, and exclude some transform if it changes the semantics on practice.
I just wonder if people ever put such an explicit list of whitespace characters but mean \s
, or maybe safer to just remove this particular optimisation and assume people wrote this list of characters because they intended to? (up to you to decide, of course)
Yeah, let’s not change runtime semantics.
OK, need to think about it. Depending on how often the use-case is, we can either exclude the \s
transform by default (and allow to opt-in), or, if it's very rare, to keep, and allow to opt-out when needed.
or, if it's very rare, to keep, and allow to opt-out when needed
I don't think keeping it as-is is a very good idea, especially if it's rarely used, since then it can just bite silently much later in runtime.