encoding icon indicating copy to clipboard operation
encoding copied to clipboard

Visualization tables has lack of descriptions

Open Mingun opened this issue 1 year ago • 2 comments

  1. It is unclear how to read "BMP coverage". It would be worth having a few introductory words on the page about what it is
  2. Some of shown information is not clear and not described anywhere, for example, numbers in the bottom of each cell: Big5-bmp https://encoding.spec.whatwg.org/big5-bmp.html
  3. Colormap (Legend) is better to include to the page with table itself, and also add a tooltips to each cell with the info from legend
  4. Why it is generally important to encode information like "Two bytes in UTF-8, code point follows immediately the code point of previous pointer" in the color? What's so important about "follows immediately" that it has it's own color?
  5. "BMP coverage" table and the other table has different cell layout: codepoint in "BMP coverage" is above the symbol and unknown number is below, the other table is opposite; a number (looks like a cell number here, but it is not a cell number in BMP) is shown in all cells in other table, but only in mapped cells in "BMP coverage"
  6. Why in the other table cell coordinates starts not from zero? What that numbers mean? Big5 https://encoding.spec.whatwg.org/big5.html

Mingun avatar Aug 21 '22 08:08 Mingun

Some of shown information is not clear and not described anywhere, for example, numbers in the bottom of each cell

That's the index number as used by the spec.

Why it is generally important to encode information like "Two bytes in UTF-8, code point follows immediately the code point of previous pointer" in the color? What's so important about "follows immediately" that it has it's own color?

These help with understanding what optimizations can be made in implementation. If the number of UTF-8 bytes is known, there's no need to branch to decide how many UTF-8 bytes are needed. "Follows immediately" is important for understanding contiguous ranges that could be arranged in data structures as such.

"BMP coverage" table and the other table has different cell layout

The number according to which the table is ordered is at the top and the other number is at the bottom. Hence, for the BMP coverage, the Unicode scalar value is at the top and for the index visualization the index number is at the top.

Why in the other table cell coordinates starts not from zero? What that numbers mean?

These are hexadecimal encoded byte values.

hsivonen avatar Aug 23 '22 07:08 hsivonen

I actually found answers for some my questions myself during understanding the indexes.json, but that would be nice to have them answered on visualization pages:

"BMP coverage" pages (such as https://encoding.spec.whatwg.org/big5-bmp.html) contains a table 256x256 with following information:

  • External header row (00 01 02 ...) is a low byte of the code point U+__XX in decimal form
  • Internal header row contains the same but in hexadecimal
  • External header column (00 01 02 ...) is a high byte of the code point U+XX__ in decimal form
  • Internal header column contains the same but in hexadecimal
  • Each cell contains
    • Code point value at the top in form U+xxxx
    • Glyph at the middle or glyph for U+FFFD (Replacement character) if that cell does not contain any mapped code point
    • position in the index (array index in JSON array of code points) which is called pointer in the specification at the bottom

The table represents 256 x 256 = 0xFFFF characters from Basic Multilingual Plane (who would doubt).


"Index" pages contains a tables with slightly different structure depending on the encoding with following information:

  • Each cell contains
    • position in the index (array index in JSON array of code points) which is called pointer in the specification
    • Glyph at the middle or glyph for U+FFFD (Replacement character) if that cell does not contain any mapped code point
    • Code point value at the bottom in form U+xxxx if cell represents a mapped value

Single-byte encodings (such as https://encoding.spec.whatwg.org/ibm866.html) contains a high half of the encoding (because they all are ASCII compatible and entries 00-7F the same as in ASCII), so the table is always 16 x 8 and represents bytes 80-FF:

  • Header row (00 01 02 ...) is a low nibble of an encoded byte (0x_X) in hexadecimal form
  • Header column (08 09 0A ...) is a high nibble of an encoded byte (0xX_) in hexadecimal form

Multi-byte encodings (such as https://encoding.spec.whatwg.org/big5.html) are more complicated. All such encodings (which are visualized) occupies 1 or 2 bytes per code point. In most cases only ASCII code points occupies 1 byte, so they are not included in visualization, other code points occupies two bytes:

  • External header row (grey) is a low byte of the code point in the encoding (__XX) in hexadecimal form
  • Internal header row (white) is just a row index in hexadecimal form
  • External header column (grey) is a high byte of the code point in the encoding (XX__) in hexadecimal form
  • Internal header column (white) is just a column index in hexadecimal form

Table dimensions depends on the encoding and represents constants that are used in encoding process.

Mingun avatar Aug 23 '22 16:08 Mingun