rune
rune copied to clipboard
Unified string type
Emacs has a unique scheme for representing strings.
character code 1st byte byte sequence
-------------- -------- -------------
0-7F 00..7F 0xxxxxxx
80-7FF C2..DF 110yyyyx 10xxxxxx
800-FFFF E0..EF 1110yyyy 10yxxxxx 10xxxxxx
10000-1FFFFF F0..F7 11110yyy 10yyxxxx 10xxxxxx 10xxxxxx
200000-3FFF7F F8 11111000 1000yxxx 10xxxxxx 10xxxxxx 10xxxxxx
3FFF80-3FFFFF C0..C1 1100000x 10xxxxxx (for eight-bit-char)
400000-... invalid
invalid 1st byte 80..BF 10xxxxxx
F9..FF 11111yyy
In each bit pattern, 'x' and 'y' each represent a single bit of the
character code payload, and at least one 'y' must be a 1 bit.
In the 5-byte sequence, the 22-bit payload cannot exceed 3FFF7F.
Raw 8-bit bytes are represented by codepoints 0x3FFF80 to 0x3FFFFF. However, in the UTF-8 like encoding, where they should be represented by a 5-byte sequence starting with 0xF8, they are instead represented by a 2-byte sequence starting with 0xC0 or 0xC1. These 2-byte sequences are disallowed in UTF-8, because they would form a duplicate encoding for the 1-byte ASCII range.
Raw bytes are either plain ascii or if they are over the ascii range of 127, they are encoded using extended unicode codepoints. These extended code points don't follow the normal rules, and therefore will ocuppy two bytes in the space between the 1 byte and two byte range. For example if I wanted to encode 137 (#o211 #x89) as a raw byte, it would be code point 0x3FFF89. Notice that the hex value is the last byte of the code point. However I would lay it out in memory like this
- Original binary :: 1000 1001 (0x89)
- remove eighth bit :: 000 1001
- encode using the table above :: 1100_0000 1000_1001
This encoding scheme is clever and flexible, but is unique to Emacs. It means that you can't reuse any string processing libraries that are expecting unicode, because Emacs supports a superset of unicode. We would like to avoid this limitation if at all possible.
Currently the plan is two have two types of string; unibyte and multibyte. unibyte strings are raw byte arrays ([u8]
) and multibyte are valid utf8 (str
). There is no concept of a "raw byte" in a multibyte string. Adding one will automatically convert it unibyte.
So far the only places I have seen unibyte strings used in the bytecompiler (for opcodes) and when viewing non text files (like a binary). Given this, I think we can get away with changing the behavior with regards to string and bytes. 99% of users will only ever work with valid multibyte strings. There may be more edge case that we need to work around in the future, but I think that effort is better then the effort of re implementing all text processing to handle a unique encoding. Hopefully this is the right trade-off. We will try to use byte arrays throughout the code when valid unicode is not needed.
Buffers will also need to come in two flavors, a UTF8 one and a raw byte version. Or we could use the same buffer type but convert all non-ascii bytes to their equivalent codepoint. We would need to make sure to handle this specially in search though.