Proposal - venturing outside the Basic Multilingual Plane

Open GregRos opened this issue 1 year ago • 0 comments

This is a solution to #98 to increase support for parsing more Unicode characters correctly.

Background

In JavaScript, characters use the UTF-16 encoding. The encoding says that:

A Unicode codepoint (character) is encoded using two bytes if it’s part of the Basic Multilingual Plane (BMP).
Outside the BMP characters are encoded using a pair of two-byte sequences called surrogate pairs. The BMP includes all standard symbols of all scripts in active use today. However, there are some things it doesn’t include:

Variant CJK characters
Unusual CJK characters
Most emoji, such as 🥔🍠
Styled mathematical characters
Characters from some historical scripts, such as Egyptian Hieroglyphs, Linear B, and Ugaritic.

Javascript Limitations

Due to technical limitations, JavaScript views characters as always being two bytes long. This goes against the UTF-16 specification, but makes writing code easier.

Anyway, it does mean that surrogate pairs are just treated as two characters. To see this, just write "🥔".length into your browser console. It will output 2.

It gets more complicated than that

Text is hard nowadays. Today there are several kinds of combining characters that might look like one symbol but are actually several separate codepoints. So in these cases, even the Unicode standard says they are multiple characters, but typesetting software displays them as something like a single unit.

Combining diacritics, such as o.
Flag characters, such as 🇨🇭.
Skin tone-modified emoji, such as 👍🏾.

Type	Example	`length`
Combining diacritics	o	1 + number of diacritics
Flag characters 🇨🇭	🇨🇭	$2 + 2=4$
Skin tone-modified emoji	👍🏾	$2 + 2 = 4$
Family emoji	👨‍👩‍👧	$2 + 1 + 2 + 1 + 2=8$

Solution

The ability to parse Unicode is important to the library and its users. Therefore we should support parsing as much Unicode as possible. This does mean we have to come up with a definition for “character”, since it’s something every piece of software seems to define separately.

This definition will have to be used at every point in the code where reading a specific number of characters is involved. This affects a large number of building blocks, such as anyCharOf, anyChar, exactly, and so on.

We should avoid duplicating code for handling this, so instead we should encapsulate the definition of “character” into an object, and instead of working directly on the input string this character object will be used.

Everything is a combinator

This character object is effectively a building block parser. This means every parser becomes a combinator given a root parser that defines “character.” However, unlike most combinators, this would be implemented on the state object itself and support two operations:

Read one character
Read N characters

There would be a default character parser. At first it will be the standard 2-byte JS character parsed for backwards compatibility, but in the future it will be replaced with the full Unicode-aware parser.

However, alternative parsers might define characters in broader terms, such as treating combining diacritics or emoji chars as single characters.

Comments?

Can you see any caveats or issues with this proposal?
Did I miss something?
Suggestions about the interface or design?
Do you feel like it would be a good feature?
Use-cases?

Feb 05 '24 13:02 GregRos

Proposal - venturing outside the Basic Multilingual Plane

Background

Javascript Limitations

It gets more complicated than that

Further reading\

Solution

Everything is a combinator

Comments?