avim icon indicating copy to clipboard operation
avim copied to clipboard

Convert VNI fonts in PDF.js on copy, find, etc.

Open 1ec5 opened this issue 11 years ago • 0 comments

Vietnamese text in PDFs is usually typeset in non-Unicode fonts that use VNI, VPS, ABC, or TCVN3 layouts. PDF.js renders this text fine, but the underlying representation is a mangled mess. Because AVIM specializes in Vietnamese input tools, it’s uniquely suited to detecting legacy-encoded Vietnamese text and converting it on the fly when finding or copying inside a PDF.

PDF.js’ text layer includes a <div> for each run of text; each <div> has a data-font-name attribute that identifies the font used for that run. There must be some way to map that identifier to the original font name, which we can then use to guess an encoding. VNI-encoded fonts always begin with “VNI-”, for instance.

1ec5 avatar Feb 04 '14 11:02 1ec5