nqp icon indicating copy to clipboard operation
nqp copied to clipboard

WIP/proof of concept: Unmangle MAST strings

Open samcv opened this issue 5 years ago • 3 comments

This is done in multiple steps:

  1. Handle resolving mangled lexical strings (so latin1 encoded as utf8 and utf8 encoded as latin1)
  2. Regenerate stage0 bootstrap
  3. Don't encode lexicals as latin-1 anymore, only encode to utf8
  4. Regenerate stage0 bootstrap

This is incomplete, but is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals. 5. Disable handling of mangled lexical strings

samcv avatar Mar 29 '20 10:03 samcv

I'm missing the overall goal of this PR? Bytecode files store strings in a couple of different ways, in order to reduce memory use and startup decoding time (because if you know it's just in the latin-1 range, a whole load of things simply cannot happen).

jnthn avatar Mar 29 '20 13:03 jnthn

is a proof of concept for fixing us having both utf8 and latin-1 encoded lexicals

I don't understand why there's anything to fix. To me this looks like it's removing an optimization.

jnthn avatar Mar 29 '20 13:03 jnthn

@jnthn if there is a good reason for it, then it can stay. But let me try and explain in more detail what I currently understand. Some of this may be incorrect, so feel free to correct/expand on this.

String Latin-1 UTF-8 Latin-1 and UTF-8 roundtrip identically?
24 A2 24 C2 A2 No
$foo 24 66 6F 6F 24 66 6F 6F Yes

It is not clear to me why we should be storing non-utf8 valid strings. The one that is of concern inside nqp is '$¢'. I am guessing rakudo also goes through this path.

It is my opinion we should only be storing/decoding strings as utf8. This would mean $foo could use the latin-1 encoder/decoder since this will have the same results as the utf8 encoder/decoder. But because '$¢' is encoded as 0x24, 0xA2 in latin-1 while 0x24, 0xC2, 0xA2 in utf-8, this results in us storing two incompatible encodings in the same blob.

If the reason for this optimization is to avoid using the full utf-8 encoder/decoder, I think it would make sense to change this pull request so we use the ASCII decoder/encoder on ASCII strings only.

I hope this makes it a bit more clear my intentions here.

samcv avatar Mar 30 '20 06:03 samcv