pdf-reader icon indicating copy to clipboard operation
pdf-reader copied to clipboard

Does pdf-reader manage tagged PDF ?

Open Noctambul opened this issue 5 years ago • 3 comments

Hi,

I'm working with some tagged PDF and I must extract array from them. This arrays are tagged and I think it's the only way to parse them properly. I mean the rows have different cell size and the arrays could be on different pages.

So I'm wondering if this PDF-Reader API is able to manage this tagged PDF ?

Thank you for your attention.

Noctambul avatar Jan 09 '20 16:01 Noctambul

Yes, and this could help for accessible PDFs.

MonsieurDart avatar Jan 10 '20 17:01 MonsieurDart

I believe pdf-reader will provide access to the tagged data, but it's pretty low level. For example, the high-ish level Page#text method ignore tags, but the low-level Page#walk_contents method should generate callbacks for tags.

Unfortunately I haven't worked with tagged PDFs myself, so I'm not super familiar with how to extract the data.

yob avatar Jan 13 '20 12:01 yob

Thank you for your answer and for the details. We will explore your suggestion with attention :) .

Noctambul avatar Jan 15 '20 16:01 Noctambul