tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

Gibberish in output

Open kevinburke opened this issue 3 years ago • 2 comments

I'm using Tabula for Mac. We are trying to export the tables in the attached PDF. concord_housing_table.pdf

The initial upload generated a lot of overlapping selections. We removed all of them except for the selections that covered the entire table row.

When we go to export, the output looks like complete gibberish:

Export Data | Tabula 2022-08-15 11-07-17

We're confused about this, because clearly it's meaningful gibberish - the number of gibberish characters corresponds to the text in the original file. Maybe we missed an encoding setting? We tried using the tools in the app but didn't see anything meaningful.

kevinburke avatar Aug 15 '22 18:08 kevinburke

Hi @kevinburke nice to see you here :)

This is almost certainly an issue in how pdfbox, the library Tabula uses to interact at a low-level with the PDF, handles PDFs generated in weird ways. The best fix is to re-encode the PDF with pdftk or Acrobat or a tool of your choice. That generally fixes things.

jeremybmerrill avatar Aug 15 '22 18:08 jeremybmerrill

It could also be a subsetted-font, which is essentially a non-standard encoding. See this StackOverflow answer.

jazzido avatar Aug 15 '22 18:08 jazzido