pdf-issues
pdf-issues copied to clipboard
Tagging (logical structure) is incompatible with shared PDF objects
Extended from discussions in issue https://github.com/pdf-association/pdf-issues/issues/343
I believe there's a fundamental incompatibility between the design of tagged pdf and reuse of pdf objects (i.e. streams & annotations), that should be resolved, or at least have the restrictions be better and more explicitly explained.
Conclusion First:
- 14.7.5.4 Finding structure elements from content items defines methods of lookup to find the parent structure node from the content item
- The lookup relies on StructParent(s) entries in the immediate containing stream (or the object itself for OBJRs), which can only result in a single location in the structure tree for any particular lookup tuple (will be defined later).
- This leads to conflicting goals/semantics when trying to reuse pdf content, as the reused content may have the same lookup tuple, leading to collisions.
- Some of these are possible to resolve, but are not obviously mentioned in the spec, while others have incompatible goals (if you wish to reuse a piece of content with its internal logical structure, this is either not possible or severely restrictive depending).
Definitions and Generalized Model:
- First, let us generalize across the different content and ways of tagging them, by considering the PDF as one big reflowed stream
- let us use something akin to XML to denote certain scopes/boundaries we're interested in, Objs, marked-content-sequences, etc.
- let us consider Annotations on a page to be equivalent (i.e. AP stream) XObjs placed at the beginning of a Page, with certain implicit restrictions
- Illustrative Example (non-exhaustive):
<Page1 obj=1>
<Annot obj=2> Shared Annot </Annot>
<Annot obj=3>Some Annot</Annot>
Page stream content here
<MC id=1> MC content here </MC>
<MC id=2>
<XObj obj=4>
XObj content here, can have more nested XObjs
Cannot have more MCIDs because we're already inside MC id=2 for page stream
</XObj>
</MC>
<XObj obj=5>
Shared Obj
<MC id=1> this MC belongs to XObj 5 </MC>
</XObj>
</Page1>
<Page2 obj=10>
<Annot obj=2> Shared Annot </Annot>
<XObj obj=5>
Shared Obj
<MC id=1> this MC belongs to XObj 5 </MC>
</XObj>
</Page2>
- Furthermore let us generalize the struct parent lookup by simply assigning a unique id for each unique "lookup tuple", for ease of reference/comparison
- MCID content can be generalized as MCR, the lookup tuple being (stm_obj, mcid) -> stm_obj being the immediate containing stream
- OBJR content have a lookup tuple that's just the (obj) itself
- notice that MCR, OBJR actually contains other distinguishing fields for the forward lookup (from struct), such as pg separate from stm, but we can't make use of them in the reverse lookup (from content)
- Combining with the above XML-like format, a piece of structural content would look something like
<XObj obj=5, lookup=5> ... </XObj>
Note that 14.7.5.1.1 Content items restricts nesting of structural content, which can be more easily stated in our generalization as:
- no structural content scope can have another structural content scope nested underneath it, whether it be XObj under MC, MC under MC, MC under XObj, etc.
- that is to say, this is illegal:
<MC lookup=5> <MC lookup=6> ... </MC> </MC> - this is necessary so that we can unambiguously perform the StructParent lookup by going to the most immediate scope that has a lookup (which is also the unique one), and that tagged content cannot overlap (which would mean you don't know which overlapped parent you should go to).
The Problem:
- With this particular transform, it's easy to see that Tagged PDF basically relies on a flattened and unique chunking of content in order to properly lookup the corresponding parent structure node.
- When anything is reused, it is equivalent to copy-pasting the entire "XML" subtree (as seen with the shared
<XObj obj=5>in the ex above) - The problem comes then, if the tagging scope (i.e. where lookup=id is assigned) is inside or eq to the sharing scope, you no longer have a unique lookup, because 2 separate instances of PDF content refer to the same structure node
e.g.
<XObj obj=5, lookup=3>...</XObj> ... then later ...<XObj obj=5, lookup=3>...</XObj>, lookup=3 is shared due to the sharing of the object, this manifests as the StructParent(s) entry being the same entry on the same object between the 2 separate content renderings.- that example was for an OBJR, but you can also extrapolate an analogous duplicated MC within the shared obj
- This may be valid under certain very restricted use-cases, but that mostly comes down to coincidence, under general usage, the 2 separate renderings should be semantically distinct, and reflected in the structure tree lookup
--
- Do note that I am not saying that the structure tree does not have the capability to distinguish the 2 during the FORWARD (from struct) lookup,
you CAN have multiple MCRs & OBJR instances in different places in the tree, that will properly find where the content lives.
- caveat: there're some edge cases where they live in the same stream/page and you're a bit screwed. Again, it basically works or not based on certain coincidences, because I think the whole scheme is not very well defined in the case of sharing.
- It's important to keep the forward and backward lookups distinct. Even more so, to look specifically at the procedure, and not the nebulously spec defined identities of "content item" and "structural content", which conflates the stream data with the struct tree linkage data (e.g. OBJR dict), because that identity/uniqueness breaks down in this case.
Provide a recommendation for correction:
-
I'm not sure there's a good way (without major overhaul) to achieve exactly what we want in terms of unambiguously referring to both instances while also maintaining the object sharing as it is traditionally done before tagging.
- I believe in order to make that happen, we'll need to add more required steps/data to the lookup algorithm, to expand the lookup tuple such that it uniquely identifies the appearance location.
- suppose we succeed in doing that, do we also want to be able to share subtrees from the struct tree? (e.g. if you share a rendered form, why wouldn't you have the same subtree detailing its logical structure?)
- note that the structure tree further has an implicit no sharing rule currently, because we need a parent pointer
-
More realistically, I think the best and least we can do, is detail these restrictions clearly and concisely in a separate section, and give people recommendations on what to do in the case they're trying to tag something that already shares objects.
- yes the spec only defines "valid" configurations, and not necessarily modifications, but absence of reminder/instruction could lead people to generate corrupt documents without realizing it.
- my current solution (let me know if anyone has better ideas), is to simply clone all the stuff that's shared, so that they're no longer shared
- yes, recursively, yes it blows up the size if the nesting is deep
- this the only GENERAL solution I can think of, but if the shared section can be tagged together, then it can continue to be shared,
concisely stated, the restriction is that: tagging scope must be strictly above sharing scope
- that is to say, the following is OK, because lookup=5 and lookup=6 are distinct, and only the CONTENT is copied:
<MC lookup=5>
<XObj obj=5> Shared </XObj>
</MC>
<MC lookup=6>
<XObj obj=5> Shared </XObj>
</MC>
I've glossed over (or even possibly missed) some stuff like optional content, and branching appearance streams, but this post is long enough as is, and we'll probably need to solve those problems separately depending on the solution(s) we pick.
I think what you're trying to achieve here is a means of a single object being referenced at multiple places in the structure tree, correct? The term "reuse" is a bit overloaded unfortunately, but it doesn't seem like you're referring to the work being done in the reuse-twg.
So perhaps you have a brand logo that is repeated on every page - an XObject, with an image and some text that is marked up internally with tags, which I'm going to call a "document fragment". And the goal is to be able to repeat that "document fragment" more than once on the structure tree, every time the XObject is referenced.
Is that the kind of use case you're thinking of? Because it's the only case I think the current design really can't handle - at the moment that brand logo would have to be presented in the structure tree as a "blob", a black box with no content, using a "StructParent" rather than "StructParents".
I agree it might be nice it were available, but I don't see a pressing need to fix this:
- Do you have to clone something? Not really. For each copy of the brand logo, create a unique XObject with a StructParent (which exists once in the tree), and draw the actual brand logo XObject inside it.
- Does the XObject have any internal tag structure? Effectively no, but then I don't see a situation where it's required. I think of it like the shadow-DOM for an element in HTML; it's nice to have, but I've yet to find a need for it.
Does it need a better explanation in the spec? I'm never going to disagree with that suggestion. But as I think I mentioned in the original issue, once you try to implement it, it turns out it can only be done one way. I'd agree that's a bit later in the process than it should be, and I'd support some sort of worked example as an explanatory note in the spec if that's what you had in mind?
Is that the kind of use case you're thinking of? Because it's the only case I think the current design really can't handle
That's what I meant when I said (as a potential expansion) the following:
suppose we succeed in doing that, do we also want to be able to share subtrees from the struct tree?
But the rest of the problems apply even if you have separate independent subsections of the structure tree, as long as the underlying content is shared. You can do the forward lookup if the tree does not have duplicate sections (it already can't currently, due to parent pointers), but the backward lookup is ill defined.
but I don't see a pressing need to fix this
From my experience, object reuse is fairly common, and in fact, if you didn't need to reuse, you could do away with the XObject entirely and just put it into the page stream.
But it is a valid position if you want to take it, that people with such object reuse need to refactor those documents so that the shared portions they wana tag are cloned and tagged separately, as I mentioned in "my current solution".
If that is our position, we should make it explicit, because it's not obvious imo. (In fact, this whole conversation started because the CURRENT spec mentions OBJRs and "rendered on multiple pages")
it's nice to have, but I've yet to find a need for it
I do agree that there's always a way around it if you know what you're doing, and you can avoid inflicting this problem on yourself if you just created docs in the "correct" way. But I think you run into this problem more if you're somehow trying to add tags to existing documents that have weird XObject reuse, and people can easily inflict problems on themselves if the spec does not help them steer clear.
For each copy of the brand logo, create a unique XObject with a StructParent (which exists once in the tree), and draw the actual brand logo XObject inside it.
This falls under the case for:
but if the shared section can be tagged together, then it can continue to be shared
You can do better actually, (i.e. you don't need to create a "wrapper XObject", unless you really wana use OBJR for some reason), just tag around the separate Do operators invoking the original shared object, if you're able to reuse the ENTIRE shared object (as a blackbox, as you say).
The cloning is for if you have structure within that object, then it must be duplicated.
Separately, this correct way of doing it is not obvious that it's required, and is different what the spec says about OBJRs currently (aka people would default to following the spec's recommendation and use OBJR instead MCID/MCR around the Do, and the tagged reuse becomes ill defined)
once you try to implement it, it turns out it can only be done one way.
I don't believe this is true at all, if you look at the other issue, it was confused for over a year with no solution. This current thread is my summary after a long time dealing with it. I've seen people (including myself) try to deal with it with a partial understanding and end up with documents with broken lookups. (it's not obvious it's broken, as in it still works, in both directions. It's just not a bijection)
NOTE 2 If the referenced object is rendered on multiple pages, each rendering requires a separate object reference. However, if it is rendered multiple times on the same page, just a single object reference suffices to identify all of them.
To clarify, this note under 14.7.5.3 PDF objects as content items is what I'm referring to. Deleting/changing it is a necessary, but imo not sufficient, step toward consistency and clarity in the spec.