tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

Arabic Letters are ???

Open yovrer opened this issue 5 years ago • 4 comments

Hello I try to extract table from PDF that contains Arabic latter but when I extract the table I get ??? for all Arabic letter this issue happens only when use the jar when I use the web app I get the data without any issue

how I can solve this issue ??

yovrer avatar Aug 21 '20 10:08 yovrer

@yovrer Open your document in Acrobat Reader and the press Command + D on OSX (or Control + D on Windows I believe). This should bring up Document Properties dialog. Under the Fonts tab, do you see something like:

Type: TrueType (CID)
Encoding: Identity-H

rayleeriver avatar Aug 21 '20 19:08 rayleeriver

Thanks @rayleeriver for replaying. Yes I see as you said in the font tab image What that should mean?

yovrer avatar Aug 23 '20 03:08 yovrer

CID/Identity-H fonts makes it impossible to parse. See Adobe's own answer https://community.adobe.com/t5/acrobat/font-encoding-settings-removing-identity-h-encoding/td-p/10605220?page=1

I also tried the Acrobat Pro DC's preflight trick with no success. I was lucky enough that we were able to change the "Font" selection from our Vendor's tool to a different one that's NOT a CID font. After that, we were able to extract Table data via Tabula.

rayleeriver avatar Aug 23 '20 21:08 rayleeriver

But why the web app can extract the arabic word from pdf without any issue ??? And the issue happen only when I use the jar !!?

yovrer avatar Aug 23 '20 22:08 yovrer