regexpp
regexpp copied to clipboard
Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode
When parsing a non-Unicode regex that contains named backreferences with the strict: true
option, a syntax error will always be throws regardless of whether the regex is actually correct or not.
Example:
const { RegExpValidator } = require("regexpp")
const validator = new RegExpValidator({ strict: true, ecmaVersion: 2020 })
validator.validatePattern(/(?<foo>A)\k<foo>/.source, undefined, undefined, false)
This produces the following error:
SyntaxError: Invalid regular expression: /(?<foo>A)\k<foo>/: Invalid escape
at RegExpValidator.raise ([...]\regexpp\.temp\src\validator.ts:847:15)
at RegExpValidator.consumeAtomEscape ([...]\regexpp\.temp\src\validator.ts:1475:18)
at RegExpValidator.consumeReverseSolidusAtomEscape ([...]\regexpp\.temp\src\validator.ts:1245:22)
at RegExpValidator.consumeAtom ([...]\regexpp\.temp\src\validator.ts:1213:18)
at RegExpValidator.consumeTerm ([...]\regexpp\.temp\src\validator.ts:1027:23)
at RegExpValidator.consumeAlternative ([...]\regexpp\.temp\src\validator.ts:1000:53)
at RegExpValidator.consumeDisjunction ([...]\regexpp\.temp\src\validator.ts:976:18)
at RegExpValidator.consumePattern ([...]\regexpp\.temp\src\validator.ts:901:14)
at RegExpValidator.validatePattern ([...]\regexpp\.temp\src\validator.ts:531:14)
at validateRegExpPattern (my-project\app.ts:12:75)
However, the regex /(?<foo>A)\k<foo>/
is valid. As stated in the proposal:
In this proposal,
\k<foo>
in non-Unicode RegExps will continue to match the literal string"k<foo>"
unless the RegExp contains a named group, in which case it will match that group or be a syntax error, depending on whether or not the RegExp has a named group namedfoo
.
Since the regex contains a named capturing group, \k<foo>
has to be parsed as a backreference. Since Annex B doesn't say anything about named backreferences, regexpp should parse this regex even with strict: true
.
However, regexpp parses it as an invalid(?) escape and throws an error in strict mode. This is because validation is done is two passes (1, 2). The bug occurs because the n
flag isn't set in the first pass causing the syntax error. This can be seen in the stack trace: the second-last line - at RegExpValidator.validatePattern ([...]\validator.ts:531:14)
- is the first parsing pass.
The fix for this bug is to determine whether the regex contains named groups ahead of time, similar to how the number of capturing groups is counted before parsing. I will make a PR.
Hi. Thank you for your detailed report!
Hmm. I'm not 100% sure if it's a spec violation. The two passes parsing came from 22.2.3.2.3 Static Semantics: ParsePattern ( patternText, u ). It says, "if no u
flag, parse the pattern without N
parameter, then if no syntax errors exist and named capturing groups exist, re-parse the pattern with N
parameter." And unfortunately, since 22.2.1 Patterns, \k
is a syntax error if no N
parameter.
Please tell me if I went wrong.
So maybe, this is a spec bug, and IdentityEscape production needs [~N] k
.
I agree with you. I also read the specification like this. \k<foo>
is a syntax error according to the spec. The only reason v8 parses the above regex is because of Annex B.
This is quite interesting because whether this is a bug or not comes down to the semantic of the strict
option. Right now strict
is implemented as "Annex B syntax is disabled for every parsing pass". In #24, I implemented strict
as "Annex B syntax is disabled for the last parsing pass" (well, not quite).
As a question to the creator of the library: What is the semantic of the strict
option?
I hope that it's the latter version because that would be a lot more useful IMO.
I opened an issue: https://github.com/tc39/ecma262/issues/2434
I don't think it's intentional that named capturing groups requires u
flag. Let's see how TC39 fixes that.
As a question to the creator of the library: What is the semantic of the strict option?
It disables Annex B completely.
For people watching this issue: we've already started with our own fork in order to not hold the wider community back anymore: https://github.com/eslint-community/regexpp
@mysticatea We would still love to move the original repo to the new @eslint-community though.
This PR is released in @eslint-community/regexpp
v4.4.1
https://github.com/eslint-community/regexpp/releases/tag/v4.4.1