Accessibility tagging
Hi there,
Was wondering if, when the dev is particularly bored, would you mind considering implementing extraction of accessibility tagging?
Thank youPlease describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.
Hi @NathanTech7713 — thanks for your interest in this library, and for this suggestion. For my own notes and for others who may be less familiar:
-
"Tagged PDFs" are documents that use the PDF spec's features for enabling accessibility, by adding semantic markup in the form of "tags": https://www.pdfa.org/wp-content/uploads/2019/06/TaggedPDFBestPracticeGuideSyntax.pdf
-
pdfminer.six, the parser on whichpdfplumberdepends, does seem to have some functionality for identifying and extracting these tags — however, that functionality comes from theTagExtractorclass, which is a subclass of thePDFDeviceclass we use. Still, it's helpful to know that there's some functionality there, even if we'd have to patch it in.
And some general questions: What should the output of this extraction look like? A nested tree of tags? Something else?
@NathanTech7713: Do you have any examples of other PDF extraction libraries that have a feature like this, and which you think would provide a useful model?
Hi! I was about to make this same feature request. I've done a bit of exploration here as I am working on extracting the structure from PDFs and, obviously, it makes sense to use explicit structure if it's there... well, sort of.
Most of the libraries that support tagged PDF are closed-source, but some functionality to extract it exists in Poppler and pdf.js, and you can see the tags by running pdfinfo -struct on a PDF (or pdfinfo -struct-text to see the content of the tags as well). Unfortunately the generation of structure and tags is, to put it mildly, highly variable across different PDF authoring tools, and I haven't come remotely close to understanding the (very convoluted) specification. The W3C has a nice overview of logical structure and tagged PDF here: https://www.w3.org/TR/2014/NOTE-WCAG20-TECHS-20140408/pdf_notes.html
Basically there are a couple of moving parts, which you can find starting in section 10.5 of the PDF 1.7 spec (or maybe section 14, if you have the Adobe/ISO document?):
- Marked content sections - this is what
pdfminer.sixwill give you if you useTagExtractorwhich I think we can agree is a sub-optimal API (I am not really sure how it could be integrated inpdfplumber). These are the sections of text/objects/whatever in the PDF that correspond to structural units. Sometimes they have meaningful tags attached directly to them (notably, LibreOffice will do this) but usually they are all tagged as "P" and have to look in the "logical structure" to get more useful information. - Logical structure - this is how you get a table of contents in the sidebar in your PDF reader, and it is pretty well supported by open-source libraries like Poppler and pdf.js, though usually with a torrent of error messages since there is so much variability in the way PDF creation tools implement it (probably because the spec is difficult to understand and full of options). The Poppler implementation is unreadable since it's written in C++ ;-) so look at the pdf.js implementation instead. You can get at this from the
StructTreeRoot,RoleMap,ParentTreeand sometimesClassMapentries in the document catalog. It's a horrible, cyclical (notablypdfminer.sixwill crash with a stack overflow trying to resolve it) mess of PDFObject references. At some point (and there are multiple ways this can happen) you will end up at a leaf node which gives you a MCID that you can use to refer back to the marked content sections noted above. But they might be indirected through theParentTreebecause Reasons. - Tagged PDF - this defines a whole bunch of extra standards on top of the two previous things, along with a (supposedly) standardized set of structural tags, a vaguely HTML+CSS-like layout model, and some extra attributes to help distinguish main content from headers, footers, etc, and also (yes!) actually define the "words" in the document which we know PDF doesn't do by default.
See https://github.com/dhdaines/alexi/blob/main/scripts/pdfstructure.py for a quick-and-dirty script (based on pdfminer.six code) which prints MCID sections and tags and attempts (but doesn't really succeed) to resolve the structure tree, and https://github.com/dhdaines/alexi/blob/main/test/data/pdf_structure.pdf for a test document with structure and tags.
What I would find minimally useful (but I can't speak for the original author of this issue) would be:
- A method to extract marked content sections and their attributes in a page, akin to
extract_words, and some way to place words fromextract_wordswithin a given content section (yes, this could just be done with the bounding box) - A method to extract (a simplified version of) the structure tree from the document such that one could easily get to the marked content sections from it and vice versa.
Woops! Got to be honest, thought I replied and then didn't!
@dhdaines sums it up quite well in what I am also hoping for.
I think I mentioned quite a while ago about eventually wanting to put together an accessible PDf reader for screen reader (totally blind) users of windows, so and accessibility tagging would be a solid way of identifying structure.
Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.
Thank you both, these very helpful notes/context. I can't promise I'll get to this soon, but it does seem worth trying to add.
If it helps I can make a preliminary PR with something like what I mentioned above (extraction of marked content sections + structure tree parsing)
@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into pdfplumber? (I.e., require the least modification of existing code / least performance impact.)
@dhdaines Thanks for the offer! Is there a particular subset of this functionality that would be easiest to start trying to integrate into
pdfplumber? (I.e., require the least modification of existing code / least performance impact.)
At first glance - extracting the structure tree is relatively easy and can be done on-demand as it's all in the document catalog - linking it to the MCIDs might have more of performance impact, at least, with pdfminer.six, since it seems like we have to decode and parse the entire document to get them, even for a single page, but I could be mistaken about this!
Thanks! That sounds like a reasonable place to start. I suppose we could expose that similarly to how we do with Page.annots — i.e., outside the main parsing function?
The pypdfium2 interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h
The
pypdfium2interface to the underlying pdfium API may be useful for this: https://pdfium.googlesource.com/pdfium/+/refs/heads/main/public/fpdf_structtree.h
Actually this is quite easy. I should have a PR for you tonight or tomorrow, I hope.
Ready for review, see PR above. I'll test it more on my PDFs of interest, but it is functional and somewhat documented, see docs/structure.md and tests/test_structure.py for examples.
Many thanks, @dhdaines, and a particular thanks for the documentation. It might take me a little while to review the PR, due to other workload and me being relatively new to the topic/feature, but on first glance, it seems like a helpful contribution.
Now that #961 and #963 are merged, is this issue all clear to close? Or are there other features that would need to be in place for us to say we've handled accessibility tagging?
Thanks! There is at least one small add-on to consider - #961 doesn't give access to the tag attributes, only the tag name. These allow you to distinguish between different types of artifacts (header, footer, etc).
I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section, as this could produce large outputs (it shouldn't be a huge problem for memory consumption since it's the same dictionary...)
"Tagged PDF" is a fairly vaguely defined standard (or perhaps I just don't fully understand it yet) so there may be other things too.
Thanks, @dhdaines. A couple of follow-up questions:
I'm not sure if we want to add them as a dictionary-valued attribute for each object in a marked content section
Could you share an example of what this would look like?
as this could produce large outputs
I agree with the general inclination here. Could we have it both ways and allow users to opt-in to this additional output?