pdftools
pdftools copied to clipboard
A sentence from columns
Dear developers, I'm having a following issue: when processing pdfs that have text formatted in columns I'm getting a sentence that consists of several lines combined from those columns. It just makes a mess out of text. Is there any solution to this problem? Or a hint how I can retain the structure of initial text?
@vanushkin please look at tabulizer
R package that deals with it
@MarcinKosinski I would love to try this solution, but tabulizer has been removed from CRAN and it has a java jar dependency whose execution is blocked by default on the computers in my office. No chance to have the sysadmins unblock it. When I export a well-formed pdf "as txt" from Adobe Acrobat, the text-flow is respected despite there being 2 columns. There must be something in the PDF inner markup that identifies the text flow. Couldn't pdftools get the text flow from that information?
Actually this is not stored in the pdf inner markup: https://ropensci.org/blog/2018/12/14/pdftools-20 I think the tabulizer tries to guess the layout of columns and tables based on whitespace.
@jeroen I've tried with a PDF file generated by Illustrator (see attached file). Despite the layout's relative complexity, Acrobat recognizes the order of the frames I've defined. This flow order must be stored somewhere, otherwise this would not be possible. Acrobat cannot just guess this on the fly.
Perhaps some inner markup elements specific to Acrobat products?