regexp-tree
regexp-tree copied to clipboard
Backreferences appearing before their capture groups incorrectly treated as decimal escapes
The ES6 spec says that an expression like \5
is a backreference "only if the integer value of DecimalEscape is <= NcapturingParens", referring to the total number of capturing groups in the entire expression. However regexp-tree only compares to the number of capture groups encountered by the time the backreference is parsed.
See 15.10.2.11_A1_T5 from test262:
/\1(A)/.exec("AA");
This is expected to match the first "A". The backreference is to a capture group that has not captured anything, which unconditionally succeeds.
Example:
./bin/regexp-tree -e '/\1(a)/'
reports:
{
"type": "RegExp",
"body": {
"type": "Alternative",
"expressions": [
{
"type": "Char",
"value": "\\1",
"kind": "decimal",
"symbol": "\u0001"
},
{
"type": "Group",
"capturing": true,
"number": 1,
"expression": {
"type": "Char",
"value": "a",
"kind": "simple"
}
}
]
},
"flags": ""
}
The first expression should be a backreference, not a decimal.
Yeah, confirm, thanks for catching this; though, it's a rare edge-case, and in fact is a "forwardreference". Need to fix though.
Similarly, named groups (\k<x>
should be a "back/forward"reference):
/\k<x>(?<x>y)/
We can fix this by tracking a map from a capturing group number to an a parsed node (either a char, or a backreference). Then at the end, update the node types, if more capturing groups where found.
It is harder with named groups though, since \k<x>
is parsed right away as a set of char \k
, <
, x
, and >
. So post-update will have to remove those chars, and group them into one backreference node.
We can fix only number references for now (for not to complicate parser with named references reinterpretation), since the use-case is really rare.
Waiting for the fix.
This could be fixed with a two-pass approach:
- a quick pass 1 that only collects group numbers and names, and
- a full pass 2 that starts with knowledge about all available group names and numbers found in pass 1.
I can't implement this, but I can give an analysis:
- Changes to make in
parser/regexp.bnf
:- On this line: make use of the third argument
options
that can be given to the functionyyparse.onParseBegin()
(cf. this line in the generated parser). - A few lines later: accept an option
{ fixedGroups: { capturingGroupsCount, namedGroups } }
, and assign that data to the relevant variables. - If such an
options.fixedGroups
is given, then also set a new variableuseFixedGroups
totrue
; - after declaring that new variable a few lines earlier, outside
yyparse.onParseBegin = (...
. - On this line with
++capturingGroupsCount
: add a conditionif (!useFixedGroups)
. - On this line that checks for duplicate group names: add the same condition so it doesn't throw the
SyntaxError
.
- On this line: make use of the third argument
- Changes to make in
parser/index.js
:- In the function
regexpTreeParser.parse = ...
(starting on this line): make two calls togeneratedParseFn()
: - Make the first call as is done now.
- If the above
SyntaxError
is thrown, then abort, e.g. by simply not catching it. - Analyze the result to find
capturingGroupsCount
and allnamedGroups
. - Make the second call with
{ ...options, fixedGroups: { capturingGroupsCount, namedGroups } }
as second argument, and return the result.
- In the function