nqp
nqp copied to clipboard
WIP/proof of concept: Unmangle MAST strings
This is done in multiple steps:
- Handle resolving mangled lexical strings (so latin1 encoded as utf8 and utf8 encoded as latin1)
- Regenerate stage0 bootstrap
- Don't encode lexicals as latin-1 anymore, only encode to utf8
- Regenerate stage0 bootstrap
This is incomplete, but is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals. 5. Disable handling of mangled lexical strings
I'm missing the overall goal of this PR? Bytecode files store strings in a couple of different ways, in order to reduce memory use and startup decoding time (because if you know it's just in the latin-1 range, a whole load of things simply cannot happen).
is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals
I don't understand why there's anything to fix. To me this looks like it's removing an optimization.
@jnthn if there is a good reason for it, then it can stay. But let me try and explain in more detail what I currently understand. Some of this may be incorrect, so feel free to correct/expand on this.
| String | Latin-1 | UTF-8 | Latin-1 and UTF-8 roundtrip identically? |
|---|---|---|---|
| $¢ | 24 A2 | 24 C2 A2 | No |
| $foo | 24 66 6F 6F | 24 66 6F 6F | Yes |
It is not clear to me why we should be storing non-utf8 valid strings. The one that is of concern inside nqp is '$¢'. I am guessing rakudo also goes through this path.
It is my opinion we should only be storing/decoding strings as utf8. This would mean $foo could use the latin-1 encoder/decoder since this will have the same results as the utf8 encoder/decoder. But because '$¢' is encoded as 0x24, 0xA2 in latin-1 while 0x24, 0xC2, 0xA2 in utf-8, this results in us storing two incompatible encodings in the same blob.
If the reason for this optimization is to avoid using the full utf-8 encoder/decoder, I think it would make sense to change this pull request so we use the ASCII decoder/encoder on ASCII strings only.
I hope this makes it a bit more clear my intentions here.