proposal-binary-ast icon indicating copy to clipboard operation
proposal-binary-ast copied to clipboard

Javascript can represent invalid unicode strings

Open kannanvijayan-zz opened this issue 7 years ago • 2 comments

It's possible for valid Javascript strings to be invalid unicode strings - arising out of the fact that JS strings are specced as arbitrary sequences of 16-bit words. This means that invalid UCS-2 sequences, for example \udc11 (which is a lone surrogate pair component) can show up in our string literals.

The binast encoding needs to handle this - we cannot assume that there is always a valid translation of a JS string to a UTF-8 string. This all relates to situations where 16-bit chars fall into the surrogate pair range.

My suggestion is the following: we translate the 16-bit word sequence as if it was a UTF-16 string. This means that when we see valid surrogate pair sequences, we translate those into unicode codepoints and re-encode as a UTF-8 sequence.

When we see surrogate pair values that occur in invalid circumstances, we encode those directly as codepoints. These 16-bit chars are not valid unicode codepoints, so there is no valid UTF-8 sequence that corresponds to them. Those sequences are thus "free" for us to use to encode invalid 16-bit codepoints.

I'm not 100% sure this needs to be addressed in the spec, but @Yoric suggested I make the issue here because it may need to be addressed here.

kannanvijayan-zz avatar Jun 14 '18 18:06 kannanvijayan-zz

WTF-8 seems like a similar scheme

mroch avatar Jun 18 '18 22:06 mroch

@mroch yeah, it's exactly that scheme :) I just didn't realize it. We can replace my whole comment with "use WTF-8 for encoding JS strings".

kannanvijayan-zz avatar Jun 19 '18 18:06 kannanvijayan-zz