tabulapdf
tabulapdf copied to clipboard
Handle non-latin encodings
This seems really challenging given the quirkiness of PDF format, but is the big issue to left to implement from rOpenSci onboarding
This works on Windows, but is screwy on Linux. Must investigate more.
My understanding of the structure of PDF files is that there is no way one could guarantee the correct encoding of text extracted from PDF in anything other than Unicode. Not only encodings can be defined for each individual text block within a single PDF file, it can even contain embedded fonts. Can we just do Encoding(out) <- "UTF-8" and remove the argument?
I don't think I understand PDF well enough to know whether that makes any sense.