avim
avim copied to clipboard
Convert VNI fonts in PDF.js on copy, find, etc.
Vietnamese text in PDFs is usually typeset in non-Unicode fonts that use VNI, VPS, ABC, or TCVN3 layouts. PDF.js renders this text fine, but the underlying representation is a mangled mess. Because AVIM specializes in Vietnamese input tools, it’s uniquely suited to detecting legacy-encoded Vietnamese text and converting it on the fly when finding or copying inside a PDF.
PDF.js’ text layer includes a <div>
for each run of text; each <div>
has a data-font-name
attribute that identifies the font used for that run. There must be some way to map that identifier to the original font name, which we can then use to guess an encoding. VNI-encoded fonts always begin with “VNI-”, for instance.