tabulapdf Handle non-latin encodings

Handle non-latin encodings

Open leeper opened this issue 9 years ago • 3 comments

trafficstars

This seems really challenging given the quirkiness of PDF format, but is the big issue to left to implement from rOpenSci onboarding

May 25 '16 08:05 leeper

This works on Windows, but is screwy on Linux. Must investigate more.

Jun 01 '16 16:06 leeper

My understanding of the structure of PDF files is that there is no way one could guarantee the correct encoding of text extracted from PDF in anything other than Unicode. Not only encodings can be defined for each individual text block within a single PDF file, it can even contain embedded fonts. Can we just do Encoding(out) <- "UTF-8" and remove the argument?

Apr 11 '18 18:04 tpaskhalis

I don't think I understand PDF well enough to know whether that makes any sense.

Apr 11 '18 23:04 leeper

tabulapdf tabulapdf copied to clipboard

Handle non-latin encodings

tabulapdf
tabulapdf copied to clipboard