stringref Clarity on units of string

Clarity on units of string

Open Maxdamantus opened this issue 1 year ago • 16 comments

I'll start by saying that I'm not well-versed in WebAssembly specifications or proposals, but I happened to come across this proposal, and I'm quite interested in Unicode strings and how they're represented in different programming languages and VMs.

This might be just a wording issue, but I think the current description of "What's a string?" could be problematic:

Therefore we define a string to be a sequence of unicode scalar values and isolated surrogates. The code units of a Java or JavaScript string can be interpreted to encode such a sequence, in the WTF-16 encoding form.

If this is taken to mean that a string is a sequence of Unicode code points (ie, "Unicode scalar values" and "isolated surrogates", basically any integer from 0x0 to 0x10FFFF), this does not correspond with "WTF-16" or JavaScript strings, since there are sequences of isolated surrogates that can't be distinctly encoded in WTF-16.

eg, the sequence [U+D83D, U+DCA9] (high surrogate, low surrogate) doesn't have an encoding form in WTF-16. Interpreting the WTF-16/UTF-16 code unit sequence <D83D DCA9> produces the sequence [U+1F4A9] (a Unicode scalar value, 💩). There are 1048576 occurrences of such sequences (one for every code point outside of the BMP), where the "obvious" encoding is already used to encode a USV.

> "\uD83D\uDCA9"
'💩'
> "\u{1F4A9}"
'💩'
> [..."\uD83D\uDCA9"].length
1
> [..."\u{1F4A9}"].length
1

Note that this is different to how strings work in Python 3, where a string is indeed a sequence of any Unicode code points:

>>> "\U0000D83D\U0000DCA9"
'\ud83d\udca9'
>>> "\U0001F4A9"
'💩'
>>> len("\U0000D83D\U0000DCA9")
2
>>> len("\U0001F4A9")
1

If this proposal is suggesting that strings work the same way as in Python 3, I think implementations will likely[0] resort to using UTF-32 in some cases, as I believe Python implementations do (I think Python implementations usually use UTF-32[1] essentially for all strings, though they will switch between using 8-bit, 16-bit and 32-bit arrays depending on the range of code points used). Other than Python, I'm not actually sure what language implementations would benefit from such a string representation.

As a side note, the section in question also lumps together Python (presumably 3) and Rust, though this might result from a misunderstanding that should hopefully be explained above. Rust strings are meant to be[2] valid UTF-8, hence they correspond to sequences of Unicode scalar values, but as explained above, Python strings can be any distinct sequence of Unicode code points.

[0] Another alternative would be using a variation of WTF-8 that preserves UTF-16 surrogates instead of normalising them to USVs on concatenation, though this seems a bit crazy.

[1] Technically this is an extension of UTF-32, since UTF-32 itself doesn't allow encoding of code points in the surrogate range.

[2] This is at least true at some API level, though technically the representation of strings in Rust is allowed to contain arbitrary bytes, where it is up to libraries to avoid emitting invalid UTF-8 to safe code: https://github.com/rust-lang/rust/issues/71033

Oct 21 '23 02:10 Maxdamantus

stringref stringref copied to clipboard

Clarity on units of string

stringref
stringref copied to clipboard