RegexAnalyzer icon indicating copy to clipboard operation
RegexAnalyzer copied to clipboard

Support `u` flag for JavaScript

Open danny0838 opened this issue 2 years ago • 11 comments

The code:

console.warn(JSON.stringify(Regex.Analyzer(/\u{20000}/u).tree(), null, 2))

throws an error as the \u{XXXXX} is not supported when the u flag is used.

danny0838 avatar Nov 19 '22 08:11 danny0838

Update to 1.2.0 (js only) takes some care of this issue, but I am not sure if something else is needed. Take a look. I leave this open.

foo123 avatar Nov 19 '22 13:11 foo123

/\u{2}/u seems to throw an error.

danny0838 avatar Nov 19 '22 14:11 danny0838

Something like /\p{Punctuation}/u need to be implemented.

danny0838 avatar Nov 19 '22 14:11 danny0838

Value of char for /\u{20000}/u is not correct. It should be a UTF-16 surrogate pair \uD840\uDC00, which can be get from String.fromCodePoint(0x20000).

Browsers that supports the unicode flag seems to support String.fromCodePoint. A polyfill may be required if this library is intended to work on a JavaScript engine that doesn't support it.

danny0838 avatar Nov 19 '22 15:11 danny0838

Regex.Analyzer(/\u{20000}/u).compile() should be /\u{20000}/u rather than /\u20000/u.

danny0838 avatar Nov 19 '22 15:11 danny0838

When the unicode flag is not set, anything like /\u{2}/ should be treated as a literal u and a quantifier {2}.

See doc for more syntax details.

danny0838 avatar Nov 19 '22 17:11 danny0838

new upload of v.1.2.0

/\u{61}/u
{
  "type": 1,
  "val": [
    {
      "type": 32,
      "val": "u{61}",
      "flags": {
        "Char": "a",
        "Code": "61",
        "UnicodePoint": true
      },
      "typeName": "UnicodeChar"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/\u{61}/
{
  "type": 1,
  "val": [
    {
      "type": 16,
      "val": {
        "type": 1024,
        "val": "u",
        "flags": {},
        "typeName": "String"
      },
      "flags": {
        "val": "{61}",
        "MatchMinimum": "61",
        "MatchMaximum": "61",
        "min": 61,
        "max": 61,
        "StartRepeats": 1,
        "isGreedy": 1
      },
      "typeName": "Quantifier"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

When the unicode flag is not set, anything like /\u{2}/ should be treated as a literal u and a quantifier {2}.

Fixed

Regex.Analyzer(/\u{20000}/u).compile() should be /\u{20000}/u rather than /\u20000/u.

Fixed

Value of char for /\u{20000}/u is not correct. It should be a UTF-16 surrogate pair \uD840\uDC00, which can be get from String.fromCodePoint(0x20000).

Fixed

Something like /\p{Punctuation}/u need to be implemented.

Only on a major update, not anytime soon

foo123 avatar Nov 19 '22 18:11 foo123

/\u{2}/u seems not correctly treated as a unicode char.

danny0838 avatar Nov 19 '22 18:11 danny0838

The unicode flag changes a behavior that an incomplete unicode sequence like /\x/u, /\x3/u, /\u/u, or /\u30/u throws.

Also a character group like /[\W-3]/u will be invalid. (See doc for more syntax details.)

Not sure if you are going to implement it.

danny0838 avatar Nov 19 '22 18:11 danny0838

/\u{2}/u seems not correctly treated as a unicode char.

Fixed

/\u{2}/u
"\\u{2}"
{
  "type": 1,
  "val": [
    {
      "type": 32,
      "val": "u{2}",
      "flags": {
        "Char": "\u0002",
        "Code": "2",
        "UnicodePoint": true
      },
      "typeName": "UnicodeChar"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/\u{61}/u
"\\u{61}"
{
  "type": 1,
  "val": [
    {
      "type": 32,
      "val": "u{61}",
      "flags": {
        "Char": "a",
        "Code": "61",
        "UnicodePoint": true
      },
      "typeName": "UnicodeChar"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}
/\u{61}/
"u{61}"
{
  "type": 1,
  "val": [
    {
      "type": 16,
      "val": {
        "type": 1024,
        "val": "u",
        "flags": {},
        "typeName": "String"
      },
      "flags": {
        "val": "{61}",
        "MatchMinimum": "61",
        "MatchMaximum": "61",
        "min": 61,
        "max": 61,
        "StartRepeats": 1,
        "isGreedy": 1
      },
      "typeName": "Quantifier"
    }
  ],
  "flags": {},
  "typeName": "Sequence"
}

foo123 avatar Nov 19 '22 19:11 foo123

Something like /\p{Punctuation}/u need to be implemented.

Only on a major update, not anytime soon

Maybe we can implement a quick support that simply creates a corresponding node with the provided value (that is, without checking whether it's really valid)? The syntax can be found in the doc. So that developers can use the library to analyze a regex with such syntax without error.

danny0838 avatar Nov 19 '22 19:11 danny0838