pdf-issues 14.7.5 (Logical structure, PDF objects as content items): parent structure elements and object references

Here I am reporting a possible bug to correct (SUB-ISSUE 1), and requesting a clarification on an apparent contradiction related to it (SUB-ISSUE 2).

SUB-ISSUE 1: Cardinality of parent structure elements of PDF objects as content items

Describing the structural parent tree from the point of view of a PDF object as content item (such as image XObject), subclause 14.7.5.4 (Finding structure elements from content items) states:

The tree shall contain an entry for each object that is a content item of at least one structure element [...].

To my understanding, the phrase "at least one" is misleading, since a PDF object that is a content item is required (see also Table 359 (Additional dictionary entries for structure element access)) to be associated to ONE AND ONLY ONE parent structure element, as later stated:

The key for each entry shall be an integer given as the value of the StructParent [...] entry in the object [...]. [...] the value [for each entry] shall be an indirect reference to the parent structure element.

This statement is unequivocal, especially considering that, on the other hand, for a content stream (page object or form XObject) containing marked-content sequences that are content items "the value shall be an array of indirect references to the sequences' parent structure elements."

SUB-ISSUE 2: Parent structure element of a PDF object as content item in case of multiple separate object references

All the above held, the following statement in NOTE 2 of subclause 14.7.5.3 (PDF objects as content items) depicts a more complex case which, to my understanding, leads to a contradiction:

If the referenced object is rendered on multiple pages, each rendering requires a separate object reference.

As you know, a PDF object as content item is included in the structure tree (see 14.7.2 (Structure hierarchy)) as an object reference (see 14.7.5.3 (PDF objects as content items)). Since a PDF object as content item can be associated, via structural parent tree, to ONE AND ONLY ONE parent structure element, NOTE 2 implies that its single parent structure element will have to contain ALL its separate object references, each one for the distinct page where it is rendered (am I right?). However, such conclusion seems to disrupt the whole purpose of logical content order (the ordering for semantic purposes) as described by subclause 14.8.2.5 (Page content order and logical content order)... What's wrong with it?

What is the correct way to place in the logical structure multiple separate object references (one per page) of the same PDF object as content item? (EXAMPLE 1 in 14.7.5.4 (Finding structure elements from content items) only shows the simple case of a single object reference; IMO, that example should be expanded to address also the case depicted in NOTE 2, ie multiple separate object references, one per page — BTW, apparently that example is the only one in the whole ISO PDF spec to mention object references, so it is crucial to integrate it)

Sep 10 '23 14:09 stechio

@DuffJohnson? @mrbhardy?

Sep 11 '23 01:09 petervwyatt

I ran into this problem when reading the spec as well. my 2 cents is that, I see a few different consistent resolutions:

disallowing multiple OBJRs from referencing the same XObj
- we basically have to replace & negate the note:
  
  "If the referenced object is rendered on multiple pages, each rendering requires a separate object reference."
- in the case you want to do multiple reference, you'll have to convert your OBJRs to MCRs and tag the "Do" operators at each instance
allowing and requiring the removal of StructParent in the Obj itself in this case (potentially a bit hacky)
- the situation must be that the XObj is tagged via the Do operator in each including stream, and has an MCID in each
- forward lookup from structure tree will go directly to the Obj via the OBJR in the leaf
  - works the same as single OBJR
- backward lookup from the content will find that StructParent does not exist, and use instead the StructParents entry of the "owning stream" (this assumes the pdf processor somehow has a context stack to look up)
  - this proceeds as normal using the MCID given by the tagged 'Do' operator
allowing and requiring StructParent (singular) to be an Array in this case
- this will provide enough information solve the multiple reference problem, but will require additional logic to distinguish the instances
  - i.e. will have to look at all the potential linked OBJRs and match to the current context (i.e. via Pg?)
- but may introduce unintended additional complexity, as all lookups will have potentially to take it into account
  - maybe fine if it's restricted to the XObjects special case? (i.e. not allowed for Pages)

Notes:

MCRs won't work for Annotations, since those are not referenced in the streams, but it's unclear reusing annotations is a valid use case, much less tagging them if reused.
Separately (unrelated to OBJR): We may want to add additional clarification/constraints on the requirement of fields such as Pg, Stm, StmOwn in the MCR to support unambiguous lookup in cases of multiple reference

Sep 29 '23 21:09 myang-apryse

Update: this post is out of date, the spec does contain a subsequent disambiguating statement

and for each content stream containing at least one marked-content sequence that is a content item.

@stechio ~~My interpretation of sub-issue-1 differs from yours wrt to intent, although maybe the spec should be written in a way to avoid this confusion.~~

~~My interpretation is that the spec intends: The [Number]Tree shall contain an entry [that is of type Ref or Array], for each [PDF] object that is a [or contains] content item[s] of at least one structure element.~~

~~which leads to 2 situations (which may bear enumerating instead of aggregating into the single sentence).~~

~~is a content item -> StructParent (singular) entry ->ref in ParentTree~~
- ~~this leads to sub-issue-2 in the case of multiple OBJR under different parents for the same XObj~~
- ~~points directly to a SINGLE parent struct element, the cardinality of this is EXACTLY ONE, as you say~~
~~is a container for content items -> StructParents (plural) -> array of refs in ParentTree~~
- ~~each entry in the array is a reference which may point at different structure elements, we don't want empty array, so the cardinality is ONE OR MORE.~~

~~The union of these 2 cases gives the requirement of at least one~~

~~I believe the bolded section is what caused this confusion, "entry" is ambiguous/confusing whether it refers to:~~

~~a NumberTree entry~~
- ~~in this case we're missing wording that allows for "containment" of content~~
~~an end element reference (I think this may be your interpretation?):~~
- ~~whether it be a NumberTree entry itself,~~
- ~~or an element within the Array that is the NumberTree entry~~

Sep 29 '23 21:09 myang-apryse

@myang-apryse:

@stechio My interpretation of sub-issue-1 differs from yours wrt to intent, although maybe the spec should be written in a way to avoid this confusion.

My interpretation is that the spec intends: The [Number]Tree shall contain an entry [that is of type Ref or Array], for each [PDF] object that is a [or contains] content item[s] of at least one structure element.

I am not persuaded: your interpretation inappropriately adds a containment facet to PDF objects, whereas the spec makes a clear distinction between, on one hand, PDF object as a content item and, on the other, content stream containing marked-content sequences. The case of a PDF object (namely, form XObject) containing marked-content sequences as content items is already specified by the latter, no conceptual overlapping is possible here.

Oct 04 '23 14:10 stechio

After a second reading, I agree with your interpretation.

Oct 04 '23 17:10 myang-apryse

@myang-apryse:

After a second reading, I agree with your interpretation.

However, I appreciated your resolution proposals for SUB-ISSUE 2 (IMHO, they are worth a more in-depth evaluation):

I ran into this problem when reading the spec as well. my 2 cents is that, I see a few different consistent resolutions:

disallowing multiple OBJRs from referencing the same XObj [...]

allowing and requiring the removal of StructParent in the Obj itself in this case (potentially a bit hacky) [...]

allowing and requiring StructParent (singular) to be an Array in this case [...]

Oct 04 '23 17:10 stechio

@mrbhardy - could you please contribute...

Nov 14 '23 02:11 petervwyatt

Here are some relatively simple ideas for clarifying the relevant current provisions. I hope they are useful.

Re: Some Normative stuff that is incorrect/insufficient.

The start of 14.7.5.3 needs clarification, it should perhaps say something like this: “When a structure element’s content contains an entire PDF object (that is not also contained in the content of any other structure element), such as . . . “

But this restriction on the use of OBJRs could be in a separate sentence.

(The wording here is probably the worst problem!) In 14.7.5.4 there is a misleading phrase, “at least one” (since this can never be “more than one”); it could replaced by “a”, without at all changing the meaning. Suggested change: “The tree shall contain an entry for each object that is a content item of at least one structure element “ ==> "The tree shall contain an entry for each dictionary object that is a content item of a structure element". Note that it is very important to alter the phrasing here; this change will not affect the normative meaning.

In general, it is important to clarify this restriction, with a clear statement such as: “A PDF object can never appear (as a content item referenced
by an OBJR) as an item in more than one structure element.” This explanation/requirement could be in a Note.

Re: Some related, non-normative, texts also need attention. This is because some sentences either do not make sense, are not sufficiently precise, or they give a wrong impression. These following two, in particular, are also from 14.7.5.3, so they are closely related to 1. above.

A. In Note 1, above Table 356, the second sentence needs a lot of clarification. I am not sure what might have been its intended meaning, as it appears to be about the content stream of some object, without saying what object this is.
It would probably be more useful if it said something about the content items of a structure element. Also, the first sentence therein may be clearer as follows: “This form of reference can only be used for entire objects that are in the content of a structure element.”

B. Note 2 may be the main cause of the current confusion? It definitely needs some additions, in order to explain that: “An object that is referenced by an OBJR in the content of a structure element can be referenced more than once (possibly using a distinct OBJR), but only when any further references to this object are within the content of that same structure element.” And/or: “Each object that is so referenced can appear only in the content of this one structure element, and in no others.”

Jan 09 '24 06:01 car222222

@car222222:

Here are some relatively simple ideas for clarifying the relevant current provisions. I hope they are useful.

Re: Some Normative stuff that is incorrect/insufficient.

Could you please provide us some context (your text looks a bit sketchy, as if it was extracted from a mailing-list thread I couldn't trace back)?

Jan 24 '24 16:01 stechio

@stechio

This was intended to help @mrbhardy to respond in a reasonably minimal way to the above (non-specific) question from @petervwyatt.

It was not derived from any other correspondence.

Naturally, it assumes familiarity with the current documentation of OBJRs and related ideas. It suggests only how to extend the current documentation so that it accurately and clearly describes the current (normative) situation.

It is not intended to deal with, or even discuss, your concerns about these currently mandated provisions.

Feb 01 '24 11:02 car222222

SFAIK, the sub-issue #1 should be clarified to indicate only a single parent (i.e., I agree that an MCID should only have a single parent). In the case of sub-issue #2, there are two mechanisms for tagging an XObject container: (1) the contents of XObject container is itself is tagged, but then the Form XObject cannot have a structural element parent (but can be referenced on the page without an associated MCID) or (2) the XObject container has no internal tagging but may be contained in the structure tree exactly once (i.e., a single parent). That said, in case (2) above, I believe that an inner Form XObject could potentially be contained by multiple outer Form XObjects, each specifying a unique parent in the structure tree, and thereby still allowing for the concept of reuse of content, but not violating the concept of a single parent.

May 07 '24 06:05 sgaither

@sgaither I think I agree with what you're saying wrt XObj2 contained in tagged XObj1, as long as XObj2 is not itself tagged through methods 1 & 2, XObj2 can be referred to multiple times. (note that the condition is explicitly required by the spec for XObj1 to be tagged)

And additionally, there's what can be considered a 3rd way to tagging (i.e. surround with structure scope) content inside an XObj, and that is to surround the "do" operator in the parent stream with an MC. This way, you can use the StructParent(s) of the parent stream, allowing for proper lookup of multiple parent structure elements from different usages of the same XObj.

In some sense, why this works is the same reason as what you said, if you treat case (2) as surrounding the entire XObj1 stream (which contains the "do" for XObj2) with an MC.

May 07 '24 17:05 myang-apryse

The PDF/UA TWG agrees with sub-issue 1 and that it should be:

The tree shall contain an entry for each object that is a content item of a structure element

May 10 '24 00:05 mrbhardy

NOTE: @sgaither to provide some examples

May 10 '24 00:05 mrbhardy

Waiting for examples from @sgaither before finalizing and closing this issue.

May 10 '24 02:05 petervwyatt

(I apologize for responding via an incorrect login in a prior response) Upon more careful examination of the specification, the verbiage at the end of 14.7.5.2 is quite clear that: "A form XObject that is painted with multiple invocations of the Do operator shall be incorporated into the Document's logical structure only by the first method, with each invocation of Do individually associated with a structure element"

First, I propose to append to this final sentence above: "See Example 6 in this subclause for multiple invocation of a form XObject to overprint a phase in conjunction with the use of ActualText to indicate that the phrase is to be read only once (see 14.9.4, "Replacement text"). Note that since each invocation of the Do operator is required to be individually associated with a structural element, a single Span directly contained inside the page content stream cannot be used to contain both invocations of the form XObject."

Second, I propose to add the following text contained within the attached file as "EXAMPLE 6": 32000-2-2020-14.7.5.2-Example-6.txt

May 20 '24 17:05 sgaither

PDF TWG agree but want a few days to review example in detail.

May 23 '24 21:05 petervwyatt

I'm coming late to this discussion, so please bear with me if I'm asking silly questions.

First off, in the example, shouldn't the /P tag of object 2 be "1 0 R" rather than elipsis? The parent element is included within this example, so should be shown unambigiously.

Secondly, I'm unusure about how "A form XObject that is painted with multiple invocations of the Do operator shall be incorporated into the Document's logical structure only by the first method, with each invocation of Do individually associated with a structure element" is supposed to work at all.

My understanding is that the semantic content of the example given should only contain "clear emphasis" once. It's invoked and hence drawn twice, but a text extraction (or a screen read etc) of the document should only actually contain that once.

I'm trying to understand how I'm supposed to know that /Fm5 should only be counted semantically once.

Is the interpreter supposed to keep a list of all the form XObjects it's encountered so far whilst processing a page stream, and to check every new invocation against that list? What if the same XObject is used from multiple pages? Or on the same page from different annotations?

I could kind of understand how to do this unambiguously if the XObject contained some piece of information that said "I only count semantically when I'm invoked as MCID x", but I can't see anything like that.

I'd claim that this kind of example (same content, multiple uses, only counting once) is MUCH less common than the usual way I'd expect to see this kind of content used (same content, multiple uses, counting every time). For instance, imagine that I've got a document advertising the benefits of new WhizzyShine. Every time I mention WhizzyShine in the text, I want to use the funky WhizzyShine logo, so I encapsulate that into an XObject and reuse that multiple times across multiple pages. During a semantic extraction from the document, I want that to appear every single time.

But this seems to be exactly the thing that we're prohibiting? There is probably something obvious I'm missing here...

If so, an example that differentiated the 2 different cases would probably help both myself and other bears of tiny brain here.

May 27 '24 11:05 robinwatts

@robinwatts (I apologize if my comments repeat some stuff, I might have misunderstood your level of understanding)

Wrt your comments on MCID and invoking, I hope the following helps, as I think I was confused in a similar way when I first read it, but the truth becomes clear as you work with PDF objs.

"Method 1" means that you surround some /FmX Do with a marked content sequence that has an MCID.

The concept of "XObj invocation" is the usage of the "Do" operator, and it acts as if you replace the instance of the "Do" instruction with the contents of the XObj stream, with some caveats on MCID as I'll discuss later:

MCID is a number that is stream local, that is to say, both the page stream and Fm5 xobj stream can contain say MCID 1. (So your comments about being used on multiple pages are resolved, as they are independent, with each instance having a unique pair of <containing stream, MCID>)

So taken all together:

/Span <</MCID 1>> BDC /Fm5 Do EMC /Span <</MCID 2>> BDC /Fm5 Do EMC

is, for the purposes of rendering, equal to

/Span <</MCID 1>> BDC ... (clear emphasis) Tj ... EMC /Span <</MCID 2>> BDC ... (clear emphasis) Tj ... EMC

With the caveat that you ignore stuff about MCIDs in the substitution, since you have to look those up relative to the immediate containing stream. "Method 1" has happened with the fact "/Fm5 Do" is in between a BDC (with MCID) and an EMC, MCID 1 refers to the first use, and MCID 2 refers to the 2nd use.

but a text extraction (or a screen read etc) of the document should only actually contain that once.

This is achieved in the example by the fact that the 2 instances are both under the same structure element node, with the ActualText being the replacement for both instances at the same time.

My understanding is that when you do the text extraction, you actually reconstruct the extraction through a traversal from the structure tree side, looking up pieces of the stream content as required (i.e. you skip stuff that's not tagged, and the stuff that is tagged is not necessarily in the same order or same # of instances as in the streams). And when there's ActualText, you just don't bother with the stream content. (so maybe my comment about "for both" is misleading)

usual way I'd expect to see this kind of content used (same content, multiple uses, counting every time)

Wrt achieving this, all you have to do is put them under separate structure nodes, as the marked content instances are separate, unlike if you use method 2 (OBJR), hence why it's prohibited.

May 29 '24 04:05 myang-apryse

Is the interpreter supposed to keep a list of all the form XObjects it's encountered so far whilst processing a page stream, and to check every new invocation against that list?

I would recommend you read the relevant sections on Streams, XObjs, etc.

The model of PDF is that the a stream contains several parts:

The "stream content", and you can consider that to just be a text block that says stuff like /Span <</MCID 1>> BDC /Fm5 Do EMC.
- A page is like a "pseudo stream" wrt its "Contents" being able to be comprised of multiple objects
Separately, it has a "Resource" Key in its dictionary, which is itself a dictionary, and in its own "XObject" key, maps names to references of XObjects that the parent stream wants to use, and the stream will refer to these XObjects by name in the stream content (i.e. /Fm5 Do, "Fm5" text is the name)
- <</Fm5 5 0 R>> => "Fm5" references object 5

So a spec conforming document's pages & streams will list (i.e. have pointers to) other streams which are the XObjs it invokes. And when processing stream content, you just look up in the Resources by name for anything you encounter.

May 29 '24 04:05 myang-apryse

If so, an example that differentiated the 2 different cases

The thing we're forbidding in the case of object reuse is "method 2" (OBJR), which is that you can forgo having explicitly having marked content sequences, and just say the entire XObj (e.g. Fm5) is tagged.

So in the example, you would take out the BDCs & EMCs surrounding Fm5 Do, and add to 5 0 obj a "StructParent" (singular) key (see Parent Tree in spec),

5 0 obj                                         %form XObject to be invoked twice
    <</Type /XObject
      /Subtype /Form
      /Length …
      /StructParent ...     % <<< the new entry we're adding
    >>
stream
...

along with corresponding other changes in the structure tree to use OBJR referring to obj 5 instead of MCIDs/MCRs.

Why it doesn't work is because if you follow the process, you can only end up at 1 destination structure element node for the entire Fm5 XObj, which means you can't differentiate between the 2 different instances of /Fm5 Do, but it's fine if you only have 1 instance (across all pages, i.e. no reuse)

Note that the "process" I'm referring to here is 14.7.5.4 Finding structure elements from content items, which is the reverse of the process used in text extraction, and uses different properties/objects. (in a non-conforming file, the 2 lookups can fail to coincide, i.e. struct is not bijective with content)

May 29 '24 04:05 myang-apryse

Actually, after going through the exercise of the example, I think I see what Note2 was referring to wrt "just a single object reference suffices to identify all of them" if they happen to be on the same page.

If you do as Note2 says, and you use a single OBJR for obj 5, then if you don't care about the instance of occurrence, it works because

in the "forward" lookup, when you go from structure to content, , you'll always be able to find the entire content subblock corresponding to obj 5, and use that for rendering/replacement in structure traversal.
- in fact, it's fine even if you have multiple OBJRs provided they're under the same node, you just "render"/replace the same content multiple times
- I think overall though, it's still dangerous/potentially inconsistent, as you can have different graphics (or other?) states on the different do operators, and depending on what your traversal output is, it might matter. (I guess it falls under "not caring", but easy to confuse/overlook as a semantic REQUIREMENT)
in the "backward" lookup, from content to structure, you can encounter multiple instances of Fm5 Do, and if you follow the StructParent, you'll end up at a single structure element node, which is fine iff you have exactly 1 such node as parent of the OBJR "leaves"
So overall, Note2 is actually stronger than necessary, and does not clarify the actual reasons for the condition

I think this is a bastardization of the lookup process that coincidentally works, and we should not encourage it. I think it's just easier for everyone if there is a bijection between what I'll call "structure leaves" and (tagged) stream content. ("structure leaves" being instances in the "K" array corresponding to MCID, MCR, OBJR ways of referring to content)

I propose we get rid of Note2 altogether, and just say only 1 OBJR can exist for each pdf obj in the entire structure tree?

note that this is actually functionally more restrictive than before, but I think produces more clarity and consistency
even if we wanted to keep Note2, it should to be amended (or clarified?) by saying that only a single structure element node must contain all OBJRs referring to this XObj (I guess it's implicit via the converse of "If it is important to distinguish between multiple renditions", but at least I was confused on that meant)

Separately @sgaither, I feel like the example does not clearly illustrate the need to switch from OBJRs to MCID/MCRs, as the use case puts the both instances under the same structure node, for which multiple OBJRs will coincidentally not fail. I suggest this example be kept to separately illustrate use of single ActualText replacing multiple renderings, but a separate one be added where we put the different instances of FmX Do under different structure nodes. (potentially with comments indicating that "if you used OBJR here, you wouldn't be able to do the reverse lookup")

Also relating to your comment, I feel like

each invocation of Do individually associated with a structure element

does not clearly exclude the "association of a do to a structure element" via an OBJR, and people can be confused into thinking that they did things correctly if they have a separate OBJR for the same XObj under different structure nodes.

I propose we change the wording to just explicitly say something to the effect of "surround with an MC that has an MCID", or define & name the concept of "tagging" if we want to include all the baggage about properly setting up the ParentTree & StructureTree data as well (since you technically can (maybe?) just make a non-struct MC that has a property called MCID).

"Tagging" would have 2 modes,

one is StructParent singular + OBJR,
and the other is MCs w/ MCID, StructParents plural + MCID/MCR

Separately separately, I suggest that we clearly emphasize (and repeat in different notes as appropriate), that any subblock of content, once tagged, can not contain more tagging within it. With examples showing some selection/combination of:

<</MCID 1>> BDC /FmX Do EMC together with FmX being OBJR tagged (have StructParent singular) is illegal
FmX being OBJR tagged, and any transitive child having any MCID or OBJR tagging is illegal
Nesting marked content sequences (wthin the same stream) that have MCID is illegal
However, nesting of marked content sequences that DON'T have MCID with ones that do is legal

@sgaither I believe this contradicts your statement

I believe that an inner Form XObject could potentially be contained by multiple outer Form XObjects, each specifying a unique parent in the structure tree, and thereby still allowing for the concept of reuse of content, but not violating the concept of a single parent

Why I believe it should is that, I think there's no real distinction between Fm Do and simply writing the stream contents there, and if we allow nested OBJRs, why can't we allow nested MCIDs? I think the purposeful flattening helps avoid potential issues & confusion with the reverse lookup, because we explicitly avoid the situation where you have to lookup all or your parent MCs and their structure. And if you always go with the innermost MC, then there's a simple transform to flatten & satisfy the lack of nesting.

May 29 '24 06:05 myang-apryse

Related (but I'm not sure if it falls outside the scope of this ticket), what should be the appropriate understanding of non-"Do" methods of stream/content incorporation?

For example (and where it's relevant to this OBJR discussion), Annotations (usually Link), can be referred to using OBJRs, can we share annotation objects across pages?

WRT to the "no tagged nesting" rule, I guess each annotation forms a separate "inclusion tree"? But what about things like TilingPatterns? Can/Should those count as XObj invocations in some sense? If we wish to support tagging within them, do we have to say something like if you SCN a tiling pattern multiple times, then that's has the same restrictions as performing multiple "do" calls, and you can't use OBJRs on them?

What about things like branching AP streams? (i.e N, R, D, and annotation sub dictionaries for /On /Off) Should those be considered as non-interfering? That is, if you use the same XObj for multiple of those branches, in any particular state, you've only used it once, so multiple OBJR is allowed? (I guess that provides a use case that invalidates what I said about "1 OBJR can exist for each pdf obj in the entire structure tree", and we'll need a more complex description or go back to Note2)

May 29 '24 06:05 myang-apryse

Sorry for the long-winded comments, let me know and I'd be happy to edit/split/delete sections.

May 29 '24 08:05 myang-apryse

Well, I am not sure where this leaves us. First, the only thing I can be clear on is Robin's first comment that object 2 in the example should correctly point to object 1 as its parent (and I am updating the example to reflect that observation - thank you, Robin). Beyond that, I think that I am now perhaps confused on all of the above analysis and it is now unclear to me that any argument I give will convince people as to the intent of the specification (at least so far as I understand it, which now may potentially be not on as solid ground as I would like). I need time to re-read 14.7.5 in its entirety to better understand all of its implications. @petervwyatt, at this point I need to withdraw my original proposal, and this very long issue thread needs to remain unresolved.

May 29 '24 13:05 sgaither

@robinwatts (I apologize if my comments repeat some stuff, I might have misunderstood your level of understanding)

No apology necessary! I like to think I understand PDF well enough at this point to know how streams and Form XObjects work :)

I am a newcomer to using the structure tags/structure tree though.

MCID is a number that is stream local

D'Oh. Of course, I had forgotten that. As you say, that completely removes my worries about it being used from multiple pages, annotations etc.

My understanding is that when you do the text extraction, you actually reconstruct the extraction through a traversal from the structure tree side, looking up pieces of the stream content as required (i.e. you skip stuff that's not tagged, and the stuff that is tagged is not necessarily in the same order or same # of instances as in the streams).

That's certainly NOT how we do text extraction in either MuPDF or Ghostscript, and I'd be really surprised if the specification was written deliberately so as to mandate code working in this way.

My concerns here sprang from the discussion of this bug at the last TWG meeting. My understanding of @sgaither's claims was that EXAMPLE-6, even without the ActualText would result in 'clear emphasis' being included in the text extracted output just once.

If I have misunderstood that claim, then please excuse the noise.

If that is indeed the claim that @sgaither was making, then my concerns over how an interpreter is supposed to know to do that stand.

May 29 '24 15:05 robinwatts

@sgaither:

[cut] Upon more careful examination of the specification, the verbiage at the end of 14.7.5.2 is quite clear that: "A form XObject that is painted with multiple invocations of the Do operator shall be incorporated into the Document's logical structure only by the first method, with each invocation of Do individually associated with a structure element"

I'm sorry, but what you are referring to (i.e., marked-content sequences as content items (14.7.5.2)) is NOT what sub-issue 2 was about (i.e., object references to PDF objects as content items (14.7.5.3), as described in NOTE 2) — I am totally aware of the use of structural marked content for multiple invocations of the Do operator, but that's NOT the point on topic. For the sake of clarity, I bring here the problematic assertion from NOTE 2, as reported in my initial comment:

If the referenced object is rendered on multiple pages, each rendering requires a separate object reference.

The conundrum here is that NOTE 2 claims that separate object references to PDF objects (NOT marked-content sequences!) are required for each rendering in case of multiple pages, despite we all agree that a PDF object as content item can be associated, via structural parent tree, to ONE AND ONLY ONE parent structure element.

I see two alternative solutions:

NOTE 2 is wrong: it should be amended to remove the incorrect assertion;
NOTE 2 is somehow correct (?): it should explicitly state the way separate object references on multiple pages are expected to be expressed, expading EXAMPLE 1 in 14.7.5.4 (Finding structure elements from content items) to address also this case (i.e., multiple separate object references, one per page).

May 29 '24 16:05 stechio

@stechio My god you're right, I had mistakenly thought they were in the same section. What I was referring to as "method 2 is actually "method 3" (since 2 is taken)

... A form XObject that is painted with multiple invocations of the Do operator shall be incorporated into the Document's logical structure only by the first method...

However, I think the overall logic is still sound and this line does offer insight into how to resolve multiple XObj invocation, if you treat OBJR as surrounding the entire XObj's stream with an MC. (so maybe another potential change is to move this line to a common section, as it doesn't need to be exclusive to "Marked-content sequences as content items")

So taken all together, we get that if an XObj is reused,

we are not allowed to have tagging inside it (no method 2)
NEW: nor are we allowed to have it be tagged via OBJR (no method 3)
it can only be tagged via surrounding individual do operators with MC (method 1)

My arguments above on the validity of method 3 + reuse stand, and I think it's an anomaly, and we should explicitly forbid it. (though its problem is less complex than the already forbidden method 2, which can have individual MCs point at different parents, and the full consistency condition is that each individual lookup ends up at the only parent)

Coming full circle, this is resolution 1 from my first comment

disallowing multiple OBJRs from referencing the same XObj

it turns out 2. is actually just method 1, with a misunderstanding on my part on how to " tag via the Do operator"

3 is not invalidated, but I feel is an overly complex exception to help take advantage of the singular nature of StructParent, if it is to be extended to method 2, we would require a 2D array for StructParent(s) plural.

May 29 '24 18:05 myang-apryse

@robinwatts My interpretation of the "semantically once using AcutalText" comes from this quote

See Example 6 in this subclause for multiple invocation of a form XObject to overprint a phase in conjunction with the use of ActualText to indicate that the phrase is to be read only once ...

May 29 '24 19:05 myang-apryse

@robinwatts My interpretation of the "semantically once using AcutalText" comes from this quote

See Example 6 in this subclause for multiple invocation of a form XObject to overprint a phase in conjunction with the use of ActualText to indicate that the phrase is to be read only once ...

Yes, the way the example is constructed, the ActualText will ensure that the text within the 2 invocations of the same XObject will only count once. I am not disagreeing with that. That seems perfectly fine and sensible.

My point of concern is with the claim that I believe was made during the meeting, which was that even without the ActualText, the contents of the XObject would only count once in the structured output, despite being invoked twice.

If I've misunderstood the claim, all that is required is for @sgaither to tell me so, and I'll cease barking up this particular tree, and retire with my apologies for wasting everyone's time.

If I haven't misunderstood the claim, then either 1) it's a reasonable claim and it's just my ignorance at play here, so I need to go away and try and understand how it can possibly work, OR 2) there is a genuine issue here.

May 29 '24 23:05 robinwatts

pdf-issues pdf-issues copied to clipboard

14.7.5 (Logical structure, PDF objects as content items): parent structure elements and object references

SUB-ISSUE 1: Cardinality of parent structure elements of PDF objects as content items

SUB-ISSUE 2: Parent structure element of a PDF object as content item in case of multiple separate object references

pdf-issues
pdf-issues copied to clipboard