regexp-tree icon indicating copy to clipboard operation
regexp-tree copied to clipboard

Parser: fix Octal escape such as \75 to be treated as an Octal number, not Decimal

Open DmitrySoshnikov opened this issue 7 years ago • 11 comments

/\75/

Should parse into:

{
  "type": "Char",
  "value": "\\75",
  "kind": "oct",
  "symbol": "=",
}

And not to:

{
  "type": "Char",
  "value": "\\75",
  "kind": "decimal",
  "symbol": "K",
}

See LegacyOctalEscapeSequence in Annex B: http://www.ecma-international.org/ecma-262/7.0/#sec-additional-syntax-string-literals

DmitrySoshnikov avatar May 11 '17 00:05 DmitrySoshnikov

The issue is even deeper. Only \0 should be treated as "Decimal", other escapes are Octal: \123, \1, \7.

However, \8, and \9 are simple chars (8, and 9 escaped, not codes).

\723 - Octal code \823 - chars \8, 2, 3

DmitrySoshnikov avatar May 11 '17 05:05 DmitrySoshnikov

I'm not sure the semantics are important here but everywhere they are mentioned as "decimal". The only exception is \0 which causes a NUL. When processed the DecimalEscape turns the literal into a number according to "The definition of “the MV of DecimalIntegerLiteral” is in 11.8.3.".

Note that in this interpretation the value cannot be an octal escape since the CFG explicitly allows you to escape \0 but not any of the other digits after it. In other words; there is no match for \04 or something like that because that is explicitly a syntax error due to DecimalEscape's definition DecimalIntegerLiteral [lookahead ∉ DecimalDigit].

Furthermore there's a note in 21.2.2.11 that says:

If \ is followed by a decimal number n whose first digit is not 0, then the escape sequence is considered to be a backreference. It is an error if n is greater than the total number of left capturing parentheses in the entire regular expression. \0 represents the <NUL> character and cannot be followed by a decimal digit.

I don't believe annex B applies to regular expressions and I don't think there's any way this can be interpreted as octals specwise. It's decimal integers only.

pvdz avatar May 11 '17 07:05 pvdz

@qfox, though the semantics is clearly exposed in today's regexp engines:

/\75/.test('='); // true
/\75/.test('K'); // false

regexp-tree currently parses it as K (should be =).

Annex B affects RegExp, see e.g.

[~U] DecimalEscape but only if the integer value of DecimalEscape is <= NcapturingParens

To manage the group backreferences:

/\2/ // valid, even if no capturing group, it's an Octal code 2

DmitrySoshnikov avatar May 11 '17 07:05 DmitrySoshnikov

And with Unicode u flag this is even a different semantics. Nodes gives me

/\75/u

// SyntaxError: Invalid regular expression: /\75/: Invalid escape

CC @mathiasbynens

DmitrySoshnikov avatar May 11 '17 07:05 DmitrySoshnikov

In the semantics;

The production ClassEscape::DecimalEscape but only if … evaluates as follows:

Evaluate DecimalEscape to obtain an EscapeValue E.
Assert: E is a character.
Let ch be E's character.
Return the one-element CharSet containing the character ch.

So the value is still interpreted according to the ("normal") DecimalEscape evaluation rules. Never as octal.

In char class escapes the escape may only be \0. I'm not entirely sure what the point of that extra rule is tbh.

And with Unicode u flag this is even a different semantics. Nodes gives me

That's because the cfg rule starts with [~U]. That means this exception only applies without the u flag. But I think you knew that. So this whole thing doesn't even apply in u mode.

pvdz avatar May 11 '17 07:05 pvdz

@DmitrySoshnikov do you agree with the above interpretation of the spec? At this point I'm pretty confident that the escape is decimal and never octal. If it was \0 it'll cause a syntax error if it is followed by another digit. Otherwise it contributes NUL to the regex. In all other cases (\1 ... \9) it'll parse into a back reference. You'll need to verify that the resulting index doesn't exceed the total number of capturing groups. Which you can only do afterwards.

I just remember the highest back reference index and count the number of groups. After the body completes I'll check whether the largest back reference is lower or equal to the number of groups.

Note that the back reference can refer to a forward group. At runtime that group will probably just be undefined. It's irrelevant at parse time.

pvdz avatar May 12 '17 07:05 pvdz

@qfox, well, the semantics of the following, still says it's Octal because of legacy octal numbers:

/[\75]/.test('='); // true, decimal 61 code, not 75 (which would be 'K')
/[\75]/.test('K'); // false

DmitrySoshnikov avatar May 12 '17 18:05 DmitrySoshnikov

Ok let's trace that.

In unit mode; we start in A until annex B overrides something

/\015/

A.8
CharacterClass :: [ [lookahead ∉ { ^ }] ClassRanges [?U] ]
ClassRanges [U] :: NonemptyClassRanges [?U]
ClassAtom [?U] :: ClassAtomNoDash [?U]
ClassAtomNoDash [U] :: \ ClassEscape [?U]
B.1.4 Regular Expressions Patterns
ClassEscape [U] :: [~U] CharacterEscape  (because the other rules don't apply)
CharacterEscape [U] :: [~U] LegacyOctalEscapeSequence
=> okay as long as the value is max 255 and otherwise;

CharacterEscape [U] :: IdentityEscape [?U]
IdentityEscape [U] :: [~U] SourceCharacter but not c

So okay, as long as you support annex B you can parse that as octal and even as decimal if that turns out impossible (keep in mind that the octal can not exceed 255, although why would you even want that).

However, none of those paths lead anywhere for /\015/u and the cfg will fizzle out leading to an error. The octal and decimal path explicitly does not allow the unicode flag.

pvdz avatar May 12 '17 19:05 pvdz

Yeah, with u flag it's an invalid encoding. I'm not sure whether regexp-tree needs to actively and fully support Annex B (after all it's a depurated legacy stuff). Probably only for interpreter module, which still should interpret /\75/.test('='); as true. Not sure if it's even essential for the parser now. But for the parser, yes, it goes to LegacyOctalEscapeSequence.

DmitrySoshnikov avatar May 12 '17 23:05 DmitrySoshnikov

Heh, I like "depurated" :)

I think you (and eventually I) have no choice here. Annex B is what browsers do with legacy code and for a long long time our tools will want to be able to process legacy code, even if it's brand new.

It's even worse for regular expressions since I don't think there's an auto-opt-in like strict mode is for module code.

pvdz avatar May 13 '17 08:05 pvdz

@qfox, ah, autocorrect for "depurated", but that's right :) Yeah, I think we still need to fix and treat \75 and other as Octal, and match = in the interpreter.

DmitrySoshnikov avatar May 14 '17 06:05 DmitrySoshnikov