rune Unified string type

Unified string type

Open CeleritasCelery opened this issue 1 year ago • 0 comments

Emacs has a unique scheme for representing strings.

  character code	1st byte   byte sequence
  --------------	--------   -------------
       0-7F		00..7F	   0xxxxxxx
      80-7FF		C2..DF	   110yyyyx 10xxxxxx
     800-FFFF		E0..EF	   1110yyyy 10yxxxxx 10xxxxxx
   10000-1FFFFF	F0..F7	   11110yyy 10yyxxxx 10xxxxxx 10xxxxxx
  200000-3FFF7F	F8	       11111000 1000yxxx 10xxxxxx 10xxxxxx 10xxxxxx
  3FFF80-3FFFFF	C0..C1	   1100000x 10xxxxxx (for eight-bit-char)
  400000-...		invalid

  invalid 1st byte	80..BF	   10xxxxxx
           F9..FF	   11111yyy

  In each bit pattern, 'x' and 'y' each represent a single bit of the
  character code payload, and at least one 'y' must be a 1 bit.
  In the 5-byte sequence, the 22-bit payload cannot exceed 3FFF7F.

Raw 8-bit bytes are represented by codepoints 0x3FFF80 to 0x3FFFFF. However, in the UTF-8 like encoding, where they should be represented by a 5-byte sequence starting with 0xF8, they are instead represented by a 2-byte sequence starting with 0xC0 or 0xC1. These 2-byte sequences are disallowed in UTF-8, because they would form a duplicate encoding for the 1-byte ASCII range.

Raw bytes are either plain ascii or if they are over the ascii range of 127, they are encoded using extended unicode codepoints. These extended code points don't follow the normal rules, and therefore will ocuppy two bytes in the space between the 1 byte and two byte range. For example if I wanted to encode 137 (#o211 #x89) as a raw byte, it would be code point 0x3FFF89. Notice that the hex value is the last byte of the code point. However I would lay it out in memory like this

Original binary :: 1000 1001 (0x89)
remove eighth bit :: 000 1001
encode using the table above :: 1100_0000 1000_1001

This encoding scheme is clever and flexible, but is unique to Emacs. It means that you can't reuse any string processing libraries that are expecting unicode, because Emacs supports a superset of unicode. We would like to avoid this limitation if at all possible.

Currently the plan is two have two types of string; unibyte and multibyte. unibyte strings are raw byte arrays ([u8]) and multibyte are valid utf8 (str). There is no concept of a "raw byte" in a multibyte string. Adding one will automatically convert it unibyte.

So far the only places I have seen unibyte strings used in the bytecompiler (for opcodes) and when viewing non text files (like a binary). Given this, I think we can get away with changing the behavior with regards to string and bytes. 99% of users will only ever work with valid multibyte strings. There may be more edge case that we need to work around in the future, but I think that effort is better then the effort of re implementing all text processing to handle a unique encoding. Hopefully this is the right trade-off. We will try to use byte arrays throughout the code when valid unicode is not needed.

Buffers will also need to come in two flavors, a UTF8 one and a raw byte version. Or we could use the same buffer type but convert all non-ascii bytes to their equivalent codepoint. We would need to make sure to handle this specially in search though.

Dec 13 '23 03:12 CeleritasCelery

rune rune copied to clipboard

Unified string type

rune
rune copied to clipboard