tabulapdf icon indicating copy to clipboard operation
tabulapdf copied to clipboard

Handle non-latin encodings

Open leeper opened this issue 9 years ago • 3 comments
trafficstars

This seems really challenging given the quirkiness of PDF format, but is the big issue to left to implement from rOpenSci onboarding

leeper avatar May 25 '16 08:05 leeper

This works on Windows, but is screwy on Linux. Must investigate more.

leeper avatar Jun 01 '16 16:06 leeper

My understanding of the structure of PDF files is that there is no way one could guarantee the correct encoding of text extracted from PDF in anything other than Unicode. Not only encodings can be defined for each individual text block within a single PDF file, it can even contain embedded fonts. Can we just do Encoding(out) <- "UTF-8" and remove the argument?

tpaskhalis avatar Apr 11 '18 18:04 tpaskhalis

I don't think I understand PDF well enough to know whether that makes any sense.

leeper avatar Apr 11 '18 23:04 leeper