regexp-tree
regexp-tree copied to clipboard
Unicode in input string is not handled
/👍/u
parses differently to
/\u{1f44d}/u
The first is becoming 2 chars \ud83d
and \udc4d
.
I might try and detect any unicode in the input string and error out if that's the case, but wondering if this lib can handle both the above the same, or maybe error?
Trying to get an understanding of how the string encoding works but I think a
/[\uD800-\uDBFF\uDC00-\uDFFF]/
check should catch high/low surrogate pair code points, which are what we'd want to throw an error for if the u
flag is present and it's not possible to switch them to the \u{x}
representations.
These ranges are from https://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs
Wondering if there is a spec somewhere that says how to handle unicode characters actually.
I.e. that says /👍/u
should be translated to /\u{1f44d}/u
. It does seem to be what Chrome does.
e.g.
/^𝌆+$/.test('𝌆') === true
/^𝌆+$/.test('𝌆𝌆') === false
I think because 𝌆
was converted to \ud834\udf06+
/^𝌆+$/.test('𝌆\udf06') === true
/^𝌆+$/u.test('𝌆𝌆') === true
Interestingly
/^\ud834\udf06+$/u.test('𝌆𝌆') === true
So looks like \ud834\udf06
gets converted to 𝌆
before the +
is taken into consideration. regexp-tree
actually handles this properly.
I think a fix for this could be changing the generated code to be
[/^[^*?+\[()\\|]/u, function() { return 'CHAR' }],
instead of
[/^[^*?+\[()\\|]/, function() { return 'CHAR' }],
when the u
flag is enabled on the input. Not sure how to do that yet
This causes _match
to return a surrogate pair as is instead of slicing it in two, and the cursor seems to be updated correctly because this._cursor += match.length
handles it being 2.
With this change
/👍/u
becomes
{
type: 'RegExp',
body: {
type: 'Char',
value: '👍',
kind: 'simple',
symbol: '👍',
codePoint: 128077
},
flags: 'u'
}
which is still different to /\u{1f44d}/u
:
{
type: 'RegExp',
body: {
type: 'Char',
value: '\\u{1f44d}',
kind: 'unicode',
symbol: '👍',
codePoint: 128077
},
flags: 'u'
}
but I think value
and kind
being different is ok because one of them is actually escaped, so they are a bit different, but importantly the codePoint
is now the same and it remains a single Char
.
Also a bit unrelated I think we should use charCodeAt
instead of codePointAt
when the u
flag is not enabled, so that users don't need to polyfill codePointAt
even if they're not using the u
flag.
WDYT?
@tjenkinson thanks for the report and investigation, I think the change looks reasonable. @mathiasbynens, what are your thoughts on this?
Also, yes, when u
is not enabled, the charCodeAt
might be a good alternative.
The u
flag indeed acts as an opt-in to using code points as the character boundary, instead of UCS-2/UTF-16 code units (without the u
flag). I wrote about that here along with some examples: https://mathiasbynens.be/notes/es6-unicode-regex