encoding
encoding copied to clipboard
Visualization tables has lack of descriptions
- It is unclear how to read "BMP coverage". It would be worth having a few introductory words on the page about what it is
- Some of shown information is not clear and not described anywhere, for example, numbers in the bottom of each cell:
https://encoding.spec.whatwg.org/big5-bmp.html
- Colormap (Legend) is better to include to the page with table itself, and also add a tooltips to each cell with the info from legend
- Why it is generally important to encode information like "Two bytes in UTF-8, code point follows immediately the code point of previous pointer" in the color? What's so important about "follows immediately" that it has it's own color?
- "BMP coverage" table and the other table has different cell layout: codepoint in "BMP coverage" is above the symbol and unknown number is below, the other table is opposite; a number (looks like a cell number here, but it is not a cell number in BMP) is shown in all cells in other table, but only in mapped cells in "BMP coverage"
- Why in the other table cell coordinates starts not from zero? What that numbers mean?
https://encoding.spec.whatwg.org/big5.html
Some of shown information is not clear and not described anywhere, for example, numbers in the bottom of each cell
That's the index number as used by the spec.
Why it is generally important to encode information like "Two bytes in UTF-8, code point follows immediately the code point of previous pointer" in the color? What's so important about "follows immediately" that it has it's own color?
These help with understanding what optimizations can be made in implementation. If the number of UTF-8 bytes is known, there's no need to branch to decide how many UTF-8 bytes are needed. "Follows immediately" is important for understanding contiguous ranges that could be arranged in data structures as such.
"BMP coverage" table and the other table has different cell layout
The number according to which the table is ordered is at the top and the other number is at the bottom. Hence, for the BMP coverage, the Unicode scalar value is at the top and for the index visualization the index number is at the top.
Why in the other table cell coordinates starts not from zero? What that numbers mean?
These are hexadecimal encoded byte values.
I actually found answers for some my questions myself during understanding the indexes.json
, but that would be nice to have them answered on visualization pages:
"BMP coverage" pages (such as https://encoding.spec.whatwg.org/big5-bmp.html) contains a table 256x256 with following information:
- External header row (
00 01 02 ...
) is a low byte of the code pointU+__XX
in decimal form - Internal header row contains the same but in hexadecimal
- External header column (
00 01 02 ...
) is a high byte of the code pointU+XX__
in decimal form - Internal header column contains the same but in hexadecimal
- Each cell contains
- Code point value at the top in form
U+xxxx
- Glyph at the middle or glyph for
U+FFFD
(Replacement character) if that cell does not contain any mapped code point - position in the index (array index in JSON array of code points) which is called pointer in the specification at the bottom
- Code point value at the top in form
The table represents 256 x 256 = 0xFFFF
characters from Basic Multilingual Plane (who would doubt).
"Index" pages contains a tables with slightly different structure depending on the encoding with following information:
- Each cell contains
- position in the index (array index in JSON array of code points) which is called pointer in the specification
- Glyph at the middle or glyph for
U+FFFD
(Replacement character) if that cell does not contain any mapped code point - Code point value at the bottom in form
U+xxxx
if cell represents a mapped value
Single-byte encodings (such as https://encoding.spec.whatwg.org/ibm866.html) contains a high half of the encoding (because they all are ASCII compatible and entries 00-7F
the same as in ASCII), so the table is always 16 x 8
and represents bytes 80-FF
:
- Header row (
00 01 02 ...
) is a low nibble of an encoded byte (0x_X
) in hexadecimal form - Header column (
08 09 0A ...
) is a high nibble of an encoded byte (0xX_
) in hexadecimal form
Multi-byte encodings (such as https://encoding.spec.whatwg.org/big5.html) are more complicated. All such encodings (which are visualized) occupies 1 or 2 bytes per code point. In most cases only ASCII code points occupies 1 byte, so they are not included in visualization, other code points occupies two bytes:
- External header row (grey) is a low byte of the code point in the encoding (
__XX
) in hexadecimal form - Internal header row (white) is just a row index in hexadecimal form
- External header column (grey) is a high byte of the code point in the encoding (
XX__
) in hexadecimal form - Internal header column (white) is just a column index in hexadecimal form
Table dimensions depends on the encoding and represents constants that are used in encoding process.