luaparse Rethink string representation

(cribbed from README.md)

Unlike strings in JavaScript, Lua strings are not Unicode strings, but bytestrings (sequences of 8-bit values); likewise, implementations of Lua parse the source code as a sequence of octets. However, the input to this parser is a JavaScript string, i.e. a sequence of 16-bit code units (not necessarily well-formed UTF-16). This poses a problem of how those code units should be interpreted, particularly if they are outside the Basic Latin block ('ASCII').

Currently, this parser handles Unicode input by encoding it in WTF-8, and reinterpreting the resulting code units as Unicode code points. This applies to string literals and (if extendedIdentifiers is enabled) to identifiers as well. Lua byte escapes inside string literals are interpreted directly as code points, while Lua 5.3 \u{} escapes are similarly decoded as UTF-8 code units reinterpreted as code points. It is as if the parser input was being interpreted as ISO-8859-1, while actually being encoded in UTF-8.

This ensures that no otherwise-valid input will be rejected due to encoding errors. Assuming the input was originally encoded in UTF-8 (which includes the case of only containing ASCII characters), it also preserves the following properties:

String literals (and identifiers, if extendedIdentifiers is enabled) will have the same representation in the AST if and only if they represent the same string in the source code: e.g. the Lua literals '💩', '\u{1f4a9}' and '\240\159\146\169' will all have "\u00f0\u009f\u0092\u00a9" in their .value property, and likewise local 💩 will have the same string in its .name property;
The String.prototype.charCodeAt method in JS can be directly used to emulate Lua's string.byte (with one argument, after shifting offsets by 1), and likewise String.prototype.substr can be used similarly to Lua's string.sub;
The .length property of decoded string values in the AST is equal to the value that the # operator would return in Lua.

Maintaining those properties makes the logic of static analysers and code transformation tools simpler. However, it poses a problem when displaying strings to the user and serialising AST back into a string; to recover the original bytestrings, values transformed in this way will have to be encoded in ISO-8859-1.

Other solutions to this problem may be considered in the future. Some of them have been listed below, with their drawbacks:

A mode that instead treats the input as if it were decoded according to ISO-8859-1 (or the x-user-defined encoding) and rejects code points that cannot appear in that encoding; may be useful for source code in encodings other than UTF-8
- Still tricky to get semantics correctly
- x-user-defined cannot take advantage of compact representation of ISO-8859-1 strings in certain JavaScript engines
Using an ArrayBuffer or Uint8Array for source code and/or string literals
- May fail to be portable to older JavaScript engines
- Cannot be (directly) serialised as JSON
- Values of those types are fixed-length, which makes manipulation cumbersome; they cannot be incrementally built by appending.
- They cannot be used as keys in objects; one has to use Map and WeakMap instead
Using a plain Array of numbers in the range [0, 256)
- May be memory-inefficient in naïve JavaScript engines
- May bloat the JSON serialisation considerably
- Cannot be used as keys in objects either
Storing string literal values as ordinary String values, and requiring that escape sequences in literals constitute well-formed UTF-8; an exception is thrown if they do not
- UTF-8 chauvinism; imposes semantics that may be unwanted
- Reduced compatibility with other Lua implementations
Like above, but instead of throwing an exception, ill-formed escapes are transformed to unpaired surrogates, just like Python's surrogateescape encoding error handler
- UTF-8 chauvinism, though to a lesser extent
- Destroys the property that ("\xc4" .. "\x99") == "\xc4\x99"
- If the AST is encoded in JSON, some JSON libraries may refuse to parse it

Cf. discussion under c05822dd3b88103b998a5417fb6fa7f1757f86b8.

Aug 17 '19 19:08 fstirlitz

I will probably add a switch to toggle between these modes:

no interpretation for string literals at all; extended identifiers not mangled
pseudo-ISO-8859-1/x-user-defined (option 0)
UTF-8 (either current behaviour or option 3/4)

Aug 21 '19 16:08 fstirlitz

Got some WIP code that implements an encodingMode option, allowing to switch between:

current behaviour
no mangling for identifiers, .value of string literal nodes is null
ISO-8859-1
x-user-defined.

A 'true UTF-8' mode (option 3 or 4) would be considerably mode involved, and perhaps not worth it. Still considering it, though.

Jan 16 '20 01:01 fstirlitz

I changed my mind; I won't keep current behaviour as the default, maybe I won't even keep it as an option; nobody seems to expect or desire it anyway. The default will be no mangling and no literal interpretation; this allows users to parse Unicode source code without hassle, while those interested in string literals can choose some other mode that ensures a coherent interpretation.

I implemented UTF-8 modes too, but they're a little hacky. I also still need to document the option.

Feb 09 '20 18:02 fstirlitz

Finally committed as https://github.com/fstirlitz/luaparse/compare/fstirlitz:2b04739...fstirlitz:10666c7.

Leaving out UTF-8 modes for the moment; I may add them later. I’m leaving this issue open until I make a decision, but either way it goes, it’s not a release blocker.

Feb 22 '20 10:02 fstirlitz

I'd still be interested in UTF-8. I've tried reading up on x-user-defined but did not come away with an understanding where it would break down -- I am interested in literal strings as I want to use luaparse to change lua source into JS.

Apr 26 '21 13:04 retorquere