encoding icon indicating copy to clipboard operation
encoding copied to clipboard

Why Big5 index contains unmappable characters?

Open Mingun opened this issue 1 year ago • 2 comments

I try to generate all of characters which particular encoding supports to generate a test files for a quick-xml. I found, that using encoding_rs crate, some codepoints, declared in https://github.com/whatwg/encoding/blob/main/indexes.json for Big5 encoding actually represented as HTML references (&#...;). Digging into that I realized, that such output is generated when character is unmappable by the encoding.

So the question is: what the rationale to include in index characters that is unmappable by the encoding? I cannot find the answer on the https://encoding.spec.whatwg.org/. It has description of how to deal with that strange index, but does not explain why this index is so strange.

Mingun avatar Aug 21 '22 09:08 Mingun

The Big5 encoder and decoder are asymmetric (like the EUC-JP encoder and decoder). The visualizations visualize what can be decoded. The spec excludes part of the decoding space from round-tripping via the encoder in order for HTML form submission not to generate extension-range bytes that some server-side recipients may not support.

For EUC-JP, the asymmetry is based on historical experience. For Big5, it is by prudent analogy of the problem initially seen with EUC-JP. Also, for Big5, the exclusion for Big5 is questionable and possibly by accident excluding less than what was intended: The encoder only excludes the extension part below the original Big5 range but doesn't exclude the other extension part above the original Big5 range.

hsivonen avatar Aug 23 '22 08:08 hsivonen

Well, probably that information should be included somewhere in the spec, probably here https://github.com/whatwg/encoding/blob/4f549cd26fd5f6a8f8bd8fc3fede519515cdea4f/encoding.bs#L959-L961 because it was a little surprising when I used indexes.json for my own goals

Mingun avatar Aug 23 '22 15:08 Mingun