unipdf internal IdentityEncoder should be more clear with rune handling

internal IdentityEncoder should be more clear with rune handling

Open gunnsth opened this issue 4 years ago • 0 comments

The IdentityEncoder is used to represent Identity-H and Identity-V encodings, that are used to map 2-byte character codes to 2-byte CIDs:

The horizontal identity mapping for 2-byte CIDs; may be used with CIDFonts
using any Registry, Ordering, and Supplement values. It maps 2-byte character
codes ranging from 0 to 65,535 to the same 2-byte CID value, interpreted highorder
byte first.

When used with TrueType CID fonts, the CID values typically map directly to GID (glyph indices), where the CID value does not have any unicode meaning. Thus it can be confusing that it implements the TextEncoder interface, having methods such as CharcodeToRune where it is returning a "rune" that is not actually the utf-8 rune but just the integer value of the CID... This is confusing and can easily lead to problems.

We probably need to clarify the terminology and maybe split the TextEncoder interface up. The Identity-H should just map bytes to CIDs and such. If a CIDToGIDMap is defined that also needs to be used.

May 23 '20 11:05 gunnsth

unipdf unipdf copied to clipboard

internal IdentityEncoder should be more clear with rune handling

unipdf
unipdf copied to clipboard