regexp-tree icon indicating copy to clipboard operation
regexp-tree copied to clipboard

Unicode in input string is not handled

Open tjenkinson opened this issue 3 years ago • 5 comments

/👍/u

parses differently to

/\u{1f44d}/u

The first is becoming 2 chars \ud83d and \udc4d.

I might try and detect any unicode in the input string and error out if that's the case, but wondering if this lib can handle both the above the same, or maybe error?

tjenkinson avatar Apr 13 '21 09:04 tjenkinson

Trying to get an understanding of how the string encoding works but I think a

/[\uD800-\uDBFF\uDC00-\uDFFF]/

check should catch high/low surrogate pair code points, which are what we'd want to throw an error for if the u flag is present and it's not possible to switch them to the \u{x} representations.

These ranges are from https://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs

tjenkinson avatar Apr 13 '21 14:04 tjenkinson

Wondering if there is a spec somewhere that says how to handle unicode characters actually.

I.e. that says /👍/u should be translated to /\u{1f44d}/u. It does seem to be what Chrome does.

e.g.

/^𝌆+$/.test('𝌆') === true
/^𝌆+$/.test('𝌆𝌆') === false

I think because 𝌆 was converted to \ud834\udf06+

/^𝌆+$/.test('𝌆\udf06') === true
/^𝌆+$/u.test('𝌆𝌆') === true

Interestingly

/^\ud834\udf06+$/u.test('𝌆𝌆') === true

So looks like \ud834\udf06 gets converted to 𝌆 before the + is taken into consideration. regexp-tree actually handles this properly.

tjenkinson avatar Apr 13 '21 15:04 tjenkinson

I think a fix for this could be changing the generated code to be

[/^[^*?+\[()\\|]/u, function() { return 'CHAR' }],

instead of

[/^[^*?+\[()\\|]/, function() { return 'CHAR' }],

when the u flag is enabled on the input. Not sure how to do that yet

This causes _match to return a surrogate pair as is instead of slicing it in two, and the cursor seems to be updated correctly because this._cursor += match.length handles it being 2.

With this change

/👍/u

becomes

{
  type: 'RegExp',
  body: {
    type: 'Char',
    value: '👍',
    kind: 'simple',
    symbol: '👍',
    codePoint: 128077
  },
  flags: 'u'
}

which is still different to /\u{1f44d}/u:

{
  type: 'RegExp',
  body: {
    type: 'Char',
    value: '\\u{1f44d}',
    kind: 'unicode',
    symbol: '👍',
    codePoint: 128077
  },
  flags: 'u'
}

but I think value and kind being different is ok because one of them is actually escaped, so they are a bit different, but importantly the codePoint is now the same and it remains a single Char.

Also a bit unrelated I think we should use charCodeAt instead of codePointAt when the u flag is not enabled, so that users don't need to polyfill codePointAt even if they're not using the u flag.

WDYT?

tjenkinson avatar Apr 13 '21 19:04 tjenkinson

@tjenkinson thanks for the report and investigation, I think the change looks reasonable. @mathiasbynens, what are your thoughts on this?

Also, yes, when u is not enabled, the charCodeAt might be a good alternative.

DmitrySoshnikov avatar Apr 14 '21 06:04 DmitrySoshnikov

The u flag indeed acts as an opt-in to using code points as the character boundary, instead of UCS-2/UTF-16 code units (without the u flag). I wrote about that here along with some examples: https://mathiasbynens.be/notes/es6-unicode-regex

mathiasbynens avatar Apr 15 '21 15:04 mathiasbynens