regexp2 icon indicating copy to clipboard operation
regexp2 copied to clipboard

Improved compatibility for named capturing groups for ECMA mode

Open dop251 opened this issue 7 months ago • 12 comments

Hi,

This PR introduces a few changes to improve compatibility with the current ECMAScript specification in terms of handling of named capture groups:

  • The name of the group must be identifier-like, inlcuding Unicode characters (https://tc39.es/ecma262/#sec-static-semantics-capturinggroupname).
  • The group name may contain Unicode escaped characters (i.e. \uXXXX or \u{....}).
  • Named groups are assigned numbers in the order of appearance from left to right (i.e. in (.)(?<a>.)(.), a will be the group number 2).
  • Duplicate group names are not allowed.
  • Unnamed groups are not automatically given names that are decimal representations of the group numbers (i.e. their group.Name will be empty).

The changes only affect ECMAScript mode. i'm not 100% sure about the correctness of the group numbering changes, so I would appreciate a careful review.

Thanks!

dop251 avatar Apr 22 '25 22:04 dop251

I agree it would be good, unfortunately it's no trivial task, it would mean going through them one by one and manually translating each...

dop251 avatar Apr 23 '25 20:04 dop251

FWIW, these tests are part of goja tests, and most of them pass (the ones not explicitly excluded).

dop251 avatar Apr 23 '25 20:04 dop251

@dop251 the goja tests pass with the current regexp2 code or after this PR? Can we include more of the test262 regexp tests in goja with this PR to test it?

Just doing a pass through I think it'd be good to add these test patterns to regexp2 with their expected groups/names checks:

assert.compareArray(/(?<x>b)|(?<x>a)/.exec("bab"), ["b", "b", undefined]);
assert.compareArray(/(?:(?<x>a)|(?<x>b))\k<x>/.exec("aa"), ["aa", "a", undefined]);
assert.compareArray(["a", "a"], "bab".match(/(?<$>a)/));
assert.compareArray(["a", "a"], "bab".match(/(?<_>a)/));
assert.throws(SyntaxError, function() {
    return new RegExp("(?<🦊>fox)", "u");
});

Specifically looking at repeated names in alternations, backreferences with dupe names, and valid and invalid funky group names.

dlclark avatar Apr 24 '25 02:04 dlclark

Repeated names in alternations is a relatively new feature (https://github.com/tc39/proposal-duplicate-named-capturing-groups) and is not implemented in this PR. If you know a way to do this, please let me know.

As for the valid and invalid characters, I'll have a look an try to add some of the tests.

dop251 avatar Apr 25 '25 21:04 dop251

Repeated names in alternations works today in regexp2 (with and without the EcmaScript option), so I want to preserve backward compatibility with existing code and patterns out there.

dlclark avatar Apr 25 '25 22:04 dlclark

The problem is it works regardless of whether they are in different alternatives or not, i.e. (?<a>a)(?<a>a) works, but it shouldn't until the proposal I've mentioned is accepted. In the tc39 tests it's marked with a specific feature flag (regexp-duplicate-named-groups).

The ideal case of course would be implementing this feature correctly, however I couldn't find an easy way to do that.

Moreover, if I just remove the restriction the group numbering will break (see https://github.com/dlclark/regexp2/pull/96/files#diff-c4ec48a56dcc27cb25a6d76ceaef555b17e75c7c398c5e31b73b29c462b3d908R973)

dop251 avatar Apr 25 '25 23:04 dop251

(?a)(?a) works, but it shouldn't until the proposal I've mentioned is accepted.

I believe though that this example should not work even after the proposal is accepted, because there a is declared twice in the same alternative.

So I think there's these problems:

  • (?<a>a)(?<a>a) works now in regexp2 w/ ECMAScript flag, but it shouldn't, even after the proposal is accepted.
  • (?<a>a)|(?<a>a) works now in regexp2 w/ ECMAScript flag, but it shouldn't.
  • (?<a>a)|(?<a>a) will need to work in regexp2 w/ECMASCript flag, once the proposal is accepted.

I was thinking that if we are restricted by backward compatibility, we could create two new flags for ECMAScript, e.g.:

  • ECMAScriptBasic
  • ECMAScriptExtended

ECMAScriptExtended would include everything from Basic. We would then create an alias for ECMAScriptBasic called just ECMAScript, to maintain older code working:

const (
	IgnoreCase              RegexOptions = 0x0001 // "i"
	Multiline                            = 0x0002 // "m"
	ExplicitCapture                      = 0x0004 // "n"
	Compiled                             = 0x0008 // "c"
	Singleline                           = 0x0010 // "s"
	IgnorePatternWhitespace              = 0x0020 // "x"
	RightToLeft                          = 0x0040 // "r"
	Debug                                = 0x0080 // "d"
	ECMAScriptBasic                      = 0x0100 // "e"
	RE2                                  = 0x0200 // RE2 compat mode
	Unicode                              = 0x0400 // "u"
	ECMAScriptExtended                   = 0x0800
	ECMAScript                           = ECMAScriptBasic
)

Then we would have more freedom to make changes for ECMAScriptExtended.

federicotdn avatar Apr 29 '25 10:04 federicotdn

I agree with the concept of a new Option for a different type of EcmaScript that is documented something along the lines of: “stricter adherence to the EcmaScript regex standard, including removal of features that regexp2 supports. Prioritizes EcmaScript standard adherence over compatibility changes and future changes may break backward compatibility if they are in line with the EcmaScript standard.”

That way developers can opt-in to future possible breaking changes like this.

We would also need to document the differences between these two EcmaScript-style modes.

EcmaScriptBasic and EcmaScriptStrict?

dlclark avatar Apr 29 '25 16:04 dlclark

I think this will just cause confusion. There is only one ECMAScript standard and making sure that all previous regexes continue to work should not be an absolute goal, in my opinion. If they used to work due to a bug or a missing feature I think it's ok if they break. People can always keep an older version of regexp2 until they fix their regexes.

dop251 avatar Apr 29 '25 18:04 dop251

I agree with the concept of a new Option for a different type of EcmaScript that is documented something along the lines of: “stricter adherence to the EcmaScript regex standard, including removal of features that regexp2 supports. Prioritizes EcmaScript standard adherence over compatibility changes and future changes may break backward compatibility if they are in line with the EcmaScript standard.”

That would be great 👍🏻

EcmaScriptBasic and EcmaScriptStrict?

I thought about using "strict" as well but then I realized it will name-clash with https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Strict_mode, and I don't believe strict mode in JS changes anything regexp-related, so it would be a bit confusing.

federicotdn avatar Apr 29 '25 18:04 federicotdn

It will also create confusion as to which version should which change go into. For example, group names starting with numbers were not allowed in ECMAScript from the very start. Should it then go into ECMAScriptBasic or ECMAScriptExtended? If the latter, then ECMAScriptBasic isn't actually ECMAScript-conformant.

dop251 avatar Apr 29 '25 18:04 dop251

The reason for the separate Option isn’t because of upcoming EcmaScript changes. It’s purely about change management of the regexp2 package.

The current EcmaScript option was not designed to be fully compatible with EcmaScript, it is more compatible based on specific behaviors (some char classes, octal parsing, and some back reference stuff), but it’s not compliant. It’s a mish-mash of features from EcmaScript regex and C# regex and was inherited from the original C# version of the engine. A flag that means “strict adherence to EcmaScript” is a new concept that I’m willing to add, but I’m not willing to re-purpose the existing flag to mean this.

So in this example group names not allowing numbers would go into Strict mode.

Also, I agree that naming it Strict could be confusing in this context—maybe something like EcmaScriptCompliant is more accurate for the focus of the new mode anyway.

dlclark avatar Apr 29 '25 19:04 dlclark