pdftools icon indicating copy to clipboard operation
pdftools copied to clipboard

A sentence from columns

Open vanushkin opened this issue 3 years ago • 4 comments

Dear developers, I'm having a following issue: when processing pdfs that have text formatted in columns I'm getting a sentence that consists of several lines combined from those columns. It just makes a mess out of text. Is there any solution to this problem? Or a hint how I can retain the structure of initial text?

vanushkin avatar Oct 08 '21 15:10 vanushkin

@vanushkin please look at tabulizer R package that deals with it

MarcinKosinski avatar Jul 07 '22 11:07 MarcinKosinski

@MarcinKosinski I would love to try this solution, but tabulizer has been removed from CRAN and it has a java jar dependency whose execution is blocked by default on the computers in my office. No chance to have the sysadmins unblock it. When I export a well-formed pdf "as txt" from Adobe Acrobat, the text-flow is respected despite there being 2 columns. There must be something in the PDF inner markup that identifies the text flow. Couldn't pdftools get the text flow from that information?

aourednik avatar Jul 03 '23 15:07 aourednik

Actually this is not stored in the pdf inner markup: https://ropensci.org/blog/2018/12/14/pdftools-20 I think the tabulizer tries to guess the layout of columns and tables based on whitespace.

jeroen avatar Jul 03 '23 16:07 jeroen

@jeroen I've tried with a PDF file generated by Illustrator (see attached file). Despite the layout's relative complexity, Acrobat recognizes the order of the frames I've defined. This flow order must be stored somewhere, otherwise this would not be possible. Acrobat cannot just guess this on the fly.

Perhaps some inner markup elements specific to Acrobat products?

image image

test-text-flow.pdf

aourednik avatar Jul 03 '23 19:07 aourednik