PDFIO.jl `pdPageExtractText` should support multi-column documents

`pdPageExtractText` should support multi-column documents

Open sambitdash opened this issue 7 years ago • 4 comments

This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.

Nov 14 '17 11:11 sambitdash

Is there any way to currently do this?

Nov 09 '20 22:11 Nosferican

Not really. You can manually estimate every textrun and see if they form a column. The specification does not provide any structural hints for the same.

Nov 11 '20 06:11 sambitdash

On a related note, since by the nature of the format the output of pdPageExtractText is not fully determined, it would be useful to:

Have access to character level information (font, bounding box and so on).
Document what the word inference and ordering heuristics are.

Nov 18 '22 11:11 vargonis

@vargonis you can use pdPageEvalContent and get the content tree. The content tree has all the bounding box information at a text run level.

Nov 18 '22 11:11 sambitdash

PDFIO.jl PDFIO.jl copied to clipboard

`pdPageExtractText` should support multi-column documents

PDFIO.jl
PDFIO.jl copied to clipboard