PDFIO.jl icon indicating copy to clipboard operation
PDFIO.jl copied to clipboard

`pdPageExtractText` should support multi-column documents

Open sambitdash opened this issue 7 years ago • 4 comments

This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.

sambitdash avatar Nov 14 '17 11:11 sambitdash

Is there any way to currently do this?

Nosferican avatar Nov 09 '20 22:11 Nosferican

Not really. You can manually estimate every textrun and see if they form a column. The specification does not provide any structural hints for the same.

sambitdash avatar Nov 11 '20 06:11 sambitdash

On a related note, since by the nature of the format the output of pdPageExtractText is not fully determined, it would be useful to:

  1. Have access to character level information (font, bounding box and so on).
  2. Document what the word inference and ordering heuristics are.

vargonis avatar Nov 18 '22 11:11 vargonis

@vargonis you can use pdPageEvalContent and get the content tree. The content tree has all the bounding box information at a text run level.

sambitdash avatar Nov 18 '22 11:11 sambitdash