LibPDF: Implement most of the spec algorithm for picking TrueType glyphs

Open nico opened this issue 1 year ago • 1 comments

Non-CID-keyed fonts in PDFs have 8-bit codepoints which are mapped from bytes to character names via encoding.

TrueType fonts don't index glyphs by name (Type1 fonts do), so the fix (codified in the spec) was to make a list of all possible glyph names and map those to (16-bit) unicode values, and then pass those into the truetype cmap.

(As a fallback, we're supposed to look at the optional names in the font's "post" table. That part isn't implemented here yet.)

I've had this sitting around locally since Nov 2023, but I thought it was a bit gross. (I made it a little less gross for the PR; it was even more gross locally.) Turns out this is mandated by the spec!

Don't be intimidated by the big diffstat: 4200 of the new lines are generated (and some of the other lines contain a comment explaining how).

Feb 24 '24 01:02 nico

For latin scripts, this fixes missing quotes and … and so on. For Cyrillic, it makes the difference between mojibake and actual text.

Before:

After:

Before:

After:

Before:

After:

Before:

After:

Before:

After:

The missing "post" table brings us from 780 files without issues (78.0%) to 739 files without issues (73.9%), but that new number is probably closer to the truth.

Feb 24 '24 01:02 nico