InitializingOpenGL icon indicating copy to clipboard operation
InitializingOpenGL copied to clipboard

Feedback: Unicode is a mess

Open tommai78101 opened this issue 4 years ago • 2 comments

Credits to /u/Ladis_Wascheharuum for providing me constructive feedback.

Sorry, but I don't like this at all. If this is meant to be primer, you need to introduce the concepts in a way that someone new can understand them. Instead, you jump in deep and kinda go all over the place. Plus there are a few errors that make it more confusing.

The Unicode Standard defines the information of a Unicode character, namely the Unicode Transformation Formats (UTF),

What? The "information" of a Unicode character would be its code point, class, decomposition, etc. A UTF is not a property (or "information") of any character, it's an encoding format that applies to code points generally.

To briefly explain what UCS-2 is, this scheme uses a single “code value” containing one or more “code points” assigned to the “code space” between 0 and 65,535 for each character, and allows 2 bytes, or 1 16-bit word, to represent that value. Thus, the “2” in UCS-2 refers to the “2-byte encoding” scheme.

This is headache-inducing to anyone who isn't already familiar with all these terms. It's also technically wrong. UCS-2 code values correspond directly to code points, one-to-one. The "or more" applies to UTF-16, not UCS-2.

Then you have a history lesson about East Asian in the UTF-16 section. If you're explaining UTF-16, you should explain it as a method of expressing more code points in 16-bit code units. Save the history lesson for another section, talking about how the code space was expanded because the original 65K was deemed inadequate.

Okay:

In general, you need to lay this out so that each section introduces a solid concept, then each following section builds on that knowledge. The way I'd do it is:

  • A short history of characters (ASCII, extended ASCII, code pages). Seriously, keep it brief.
  • Unicode invented as a way to encode all characters in all languages. (Define "unicode character" and "abstract character" here). Each character is assigned a code point. Code points are stable. Code points are just numbers.
  • Mention code space; original 16-bit, then expanded.
  • Introduce UTFs as a means of encoding code points in binary data. (Mention UCS-2 was designed for the original historic 16-bit code space) Talk about pros and cons of each, get into technical explanations of each one here.

The big mistake people explaining Unicode make is trying to explain UTFs before explaining code points. Code points are just numbers; an stable index of characters from all written languages. This is the heart of Unicode and the most important thing. UTFs are just ways of storing code points in data.

tommai78101 avatar May 31 '20 04:05 tommai78101