Proposal - venturing outside the Basic Multilingual Plane
This is a solution to #98 to increase support for parsing more Unicode characters correctly.
Background
In JavaScript, characters use the UTF-16 encoding. The encoding says that:
- A Unicode codepoint (character) is encoded using two bytes if it’s part of the Basic Multilingual Plane (BMP).
- Outside the BMP characters are encoded using a pair of two-byte sequences called surrogate pairs. The BMP includes all standard symbols of all scripts in active use today. However, there are some things it doesn’t include:
- Variant CJK characters
- Unusual CJK characters
- Most emoji, such as 🥔🍠
- Styled mathematical characters
- Characters from some historical scripts, such as Egyptian Hieroglyphs, Linear B, and Ugaritic.
Javascript Limitations
Due to technical limitations, JavaScript views characters as always being two bytes long. This goes against the UTF-16 specification, but makes writing code easier.
Anyway, it does mean that surrogate pairs are just treated as two characters. To see this, just write "🥔".length into your browser console. It will output 2.
It gets more complicated than that
Text is hard nowadays. Today there are several kinds of combining characters that might look like one symbol but are actually several separate codepoints. So in these cases, even the Unicode standard says they are multiple characters, but typesetting software displays them as something like a single unit.
- Combining diacritics, such as o.
- Flag characters, such as 🇨🇭.
- Skin tone-modified emoji, such as 👍🏾.
| Type | Example | length |
|---|---|---|
| Combining diacritics | o | 1 + number of diacritics |
| Flag characters 🇨🇭 | 🇨🇭 | $2 + 2=4$ |
| Skin tone-modified emoji | 👍🏾 | $2 + 2 = 4$ |
| Family emoji | 👨👩👧 | $2 + 1 + 2 + 1 + 2=8$ |
Further reading\
- A video I linked above, https://www.youtube.com/watch?v=JUFpjJrYW8w which has a CJK focus
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
- That’s all I have right now
Solution
The ability to parse Unicode is important to the library and its users. Therefore we should support parsing as much Unicode as possible. This does mean we have to come up with a definition for “character”, since it’s something every piece of software seems to define separately.
This definition will have to be used at every point in the code where reading a specific number of characters is involved. This affects a large number of building blocks, such as anyCharOf, anyChar, exactly, and so on.
We should avoid duplicating code for handling this, so instead we should encapsulate the definition of “character” into an object, and instead of working directly on the input string this character object will be used.
Everything is a combinator
This character object is effectively a building block parser. This means every parser becomes a combinator given a root parser that defines “character.” However, unlike most combinators, this would be implemented on the state object itself and support two operations:
- Read one character
- Read N characters
There would be a default character parser. At first it will be the standard 2-byte JS character parsed for backwards compatibility, but in the future it will be replaced with the full Unicode-aware parser.
However, alternative parsers might define characters in broader terms, such as treating combining diacritics or emoji chars as single characters.
Comments?
- Can you see any caveats or issues with this proposal?
- Did I miss something?
- Suggestions about the interface or design?
- Do you feel like it would be a good feature?
- Use-cases?