PDFIO.jl
PDFIO.jl copied to clipboard
`pdPageExtractText` should support multi-column documents
This implementation may be needed to be reviewed along with #2. Although, there may not be an exact overlap in some cases the implementation logic can be similar.
Is there any way to currently do this?
Not really. You can manually estimate every textrun and see if they form a column. The specification does not provide any structural hints for the same.
On a related note, since by the nature of the format the output of pdPageExtractText is not fully determined, it would be useful to:
- Have access to character level information (font, bounding box and so on).
- Document what the word inference and ordering heuristics are.
@vargonis you can use pdPageEvalContent and get the content tree. The content tree has all the bounding box information at a text run level.