Rethink string representation
(cribbed from README.md)
Unlike strings in JavaScript, Lua strings are not Unicode strings, but bytestrings (sequences of 8-bit values); likewise, implementations of Lua parse the source code as a sequence of octets. However, the input to this parser is a JavaScript string, i.e. a sequence of 16-bit code units (not necessarily well-formed UTF-16). This poses a problem of how those code units should be interpreted, particularly if they are outside the Basic Latin block ('ASCII').
Currently, this parser handles Unicode input by encoding it in WTF-8, and reinterpreting the resulting code units as Unicode code points. This applies to string literals and (if extendedIdentifiers is enabled) to identifiers as well. Lua byte escapes inside string literals are interpreted directly as code points, while Lua 5.3 \u{} escapes are similarly decoded as UTF-8 code units reinterpreted as code points. It is as if the parser input was being interpreted as ISO-8859-1, while actually being encoded in UTF-8.
This ensures that no otherwise-valid input will be rejected due to encoding errors. Assuming the input was originally encoded in UTF-8 (which includes the case of only containing ASCII characters), it also preserves the following properties:
- String literals (and identifiers, if
extendedIdentifiersis enabled) will have the same representation in the AST if and only if they represent the same string in the source code: e.g. the Lua literals'💩','\u{1f4a9}'and'\240\159\146\169'will all have"\u00f0\u009f\u0092\u00a9"in their.valueproperty, and likewiselocal 💩will have the same string in its.nameproperty; - The
String.prototype.charCodeAtmethod in JS can be directly used to emulate Lua'sstring.byte(with one argument, after shifting offsets by 1), and likewiseString.prototype.substrcan be used similarly to Lua'sstring.sub; - The
.lengthproperty of decoded string values in the AST is equal to the value that the#operator would return in Lua.
Maintaining those properties makes the logic of static analysers and code transformation tools simpler. However, it poses a problem when displaying strings to the user and serialising AST back into a string; to recover the original bytestrings, values transformed in this way will have to be encoded in ISO-8859-1.
Other solutions to this problem may be considered in the future. Some of them have been listed below, with their drawbacks:
- A mode that instead treats the input as if it were decoded according to ISO-8859-1 (or the
x-user-definedencoding) and rejects code points that cannot appear in that encoding; may be useful for source code in encodings other than UTF-8- Still tricky to get semantics correctly
-
x-user-definedcannot take advantage of compact representation of ISO-8859-1 strings in certain JavaScript engines
- Using an
ArrayBufferorUint8Arrayfor source code and/or string literals- May fail to be portable to older JavaScript engines
- Cannot be (directly) serialised as JSON
- Values of those types are fixed-length, which makes manipulation cumbersome; they cannot be incrementally built by appending.
- They cannot be used as keys in objects; one has to use
MapandWeakMapinstead
- Using a plain
Arrayof numbers in the range [0, 256)- May be memory-inefficient in naïve JavaScript engines
- May bloat the JSON serialisation considerably
- Cannot be used as keys in objects either
- Storing string literal values as ordinary
Stringvalues, and requiring that escape sequences in literals constitute well-formed UTF-8; an exception is thrown if they do not- UTF-8 chauvinism; imposes semantics that may be unwanted
- Reduced compatibility with other Lua implementations
- Like above, but instead of throwing an exception, ill-formed escapes are transformed to unpaired surrogates, just like Python's
surrogateescapeencoding error handler- UTF-8 chauvinism, though to a lesser extent
- Destroys the property that
("\xc4" .. "\x99") == "\xc4\x99" - If the AST is encoded in JSON, some JSON libraries may refuse to parse it
Cf. discussion under c05822dd3b88103b998a5417fb6fa7f1757f86b8.
I will probably add a switch to toggle between these modes:
- no interpretation for string literals at all; extended identifiers not mangled
- pseudo-ISO-8859-1/
x-user-defined(option 0) - UTF-8 (either current behaviour or option 3/4)
Got some WIP code that implements an encodingMode option, allowing to switch between:
- current behaviour
- no mangling for identifiers,
.valueof string literal nodes isnull - ISO-8859-1
-
x-user-defined.
A 'true UTF-8' mode (option 3 or 4) would be considerably mode involved, and perhaps not worth it. Still considering it, though.
I changed my mind; I won't keep current behaviour as the default, maybe I won't even keep it as an option; nobody seems to expect or desire it anyway. The default will be no mangling and no literal interpretation; this allows users to parse Unicode source code without hassle, while those interested in string literals can choose some other mode that ensures a coherent interpretation.
I implemented UTF-8 modes too, but they're a little hacky. I also still need to document the option.
Finally committed as https://github.com/fstirlitz/luaparse/compare/fstirlitz:2b04739...fstirlitz:10666c7.
Leaving out UTF-8 modes for the moment; I may add them later. I’m leaving this issue open until I make a decision, but either way it goes, it’s not a release blocker.
I'd still be interested in UTF-8. I've tried reading up on x-user-defined but did not come away with an understanding where it would break down -- I am interested in literal strings as I want to use luaparse to change lua source into JS.