regexp-tree icon indicating copy to clipboard operation
regexp-tree copied to clipboard

Backreferences appearing before their capture groups incorrectly treated as decimal escapes

Open ridiculousfish opened this issue 7 years ago • 3 comments

The ES6 spec says that an expression like \5 is a backreference "only if the integer value of DecimalEscape is <= NcapturingParens", referring to the total number of capturing groups in the entire expression. However regexp-tree only compares to the number of capture groups encountered by the time the backreference is parsed.

See 15.10.2.11_A1_T5 from test262:

/\1(A)/.exec("AA");

This is expected to match the first "A". The backreference is to a capture group that has not captured anything, which unconditionally succeeds.

Example:

./bin/regexp-tree -e '/\1(a)/'

reports:

{
  "type": "RegExp",
  "body": {
    "type": "Alternative",
    "expressions": [
      {
        "type": "Char",
        "value": "\\1",
        "kind": "decimal",
        "symbol": "\u0001"
      },
      {
        "type": "Group",
        "capturing": true,
        "number": 1,
        "expression": {
          "type": "Char",
          "value": "a",
          "kind": "simple"
        }
      }
    ]
  },
  "flags": ""
}

The first expression should be a backreference, not a decimal.

ridiculousfish avatar May 03 '17 02:05 ridiculousfish

Yeah, confirm, thanks for catching this; though, it's a rare edge-case, and in fact is a "forwardreference". Need to fix though.

Similarly, named groups (\k<x> should be a "back/forward"reference):

/\k<x>(?<x>y)/

We can fix this by tracking a map from a capturing group number to an a parsed node (either a char, or a backreference). Then at the end, update the node types, if more capturing groups where found.

It is harder with named groups though, since \k<x> is parsed right away as a set of char \k, <, x, and >. So post-update will have to remove those chars, and group them into one backreference node.

We can fix only number references for now (for not to complicate parser with named references reinterpretation), since the use-case is really rare.

DmitrySoshnikov avatar May 03 '17 03:05 DmitrySoshnikov

Waiting for the fix.

MangaD avatar Jul 12 '21 08:07 MangaD

This could be fixed with a two-pass approach:

  • a quick pass 1 that only collects group numbers and names, and
  • a full pass 2 that starts with knowledge about all available group names and numbers found in pass 1.


I can't implement this, but I can give an analysis:

  • Changes to make in parser/regexp.bnf:
    • On this line: make use of the third argument options that can be given to the function yyparse.onParseBegin() (cf. this line in the generated parser).
    • A few lines later: accept an option { fixedGroups: { capturingGroupsCount, namedGroups } }, and assign that data to the relevant variables.
    • If such an options.fixedGroups is given, then also set a new variable useFixedGroups to true;
    • after declaring that new variable a few lines earlier, outside yyparse.onParseBegin = (....
    • On this line with ++capturingGroupsCount: add a condition if (!useFixedGroups).
    • On this line that checks for duplicate group names: add the same condition so it doesn't throw the SyntaxError.
  • Changes to make in parser/index.js:
    • In the function regexpTreeParser.parse = ... (starting on this line): make two calls to generatedParseFn():
    • Make the first call as is done now.
    • If the above SyntaxError is thrown, then abort, e.g. by simply not catching it.
    • Analyze the result to find capturingGroupsCount and all namedGroups.
    • Make the second call with { ...options, fixedGroups: { capturingGroupsCount, namedGroups } } as second argument, and return the result.

datamoon avatar Apr 21 '22 20:04 datamoon