pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

Scope of MCID and location of StructParents is ambiguously defined

Open dhdaines opened this issue 7 months ago • 7 comments

As noted in section 7.7.3.3, Table 31, a page may contain multiple content streams, which are concatenated to form a single logical stream:

Contents: The value shall be either a single stream or an array of streams. If the value is an array, the effect shall be as if all of the streams in the array were concatenated with at least one white-space character added between the streams’ data, in order, to form a single stream [...] Applications that consume or produce PDF files need not preserve the existing structure of the Contents array.

In chapter 14, the term "content stream" is used repeatedly to define the scope of marked content sequences, e.g. Section 14.7.5.2:

The marked-content sequence shall contain a property list (see 14.6.2, "Property lists") containing an MCID entry, which shall be an integer marked-content identifier that uniquely identifies the marked-content sequence within its content stream

There is a problem here. Excluding the case of Form XObjects for the moment, can we assume that this scope is not an actual individual content stream, but rather the logical content stream as defined by the Contents array in a page?

This is not at all clear from the definition of structure content items, for instance, Section 14.7.5.4:

For a content stream containing marked-content sequences that are content items, the value shall be an array of indirect references to the sequences’ parent structure elements.

To locate the relevant parent tree entry, each object or content stream that is represented in the tree shall contain a special dictionary entry, StructParent or StructParents

This is implicitly clarified afterwards in a way that suggests that we didn't actually mean "content stream" but actually "page object", except in the case of Form XObjects:

Depending on the type of content item, this entry may appear in the page object of a page containing marked-content sequences, in the stream dictionary of a form or image XObject, or in an annotation dictionary.

Unfortunately there is still some ambiguity here, because if a page contains multiple content streams, or even if it has only one, these content streams could also have StructParents in their dictionary. What happens in that case? Are MCIDs scoped to the individual content streams or to the page? Should the stream's StructParents be used instead of the page dictionary? (I sincerely hope not...) (maybe this is Good, Actually, see below?)

One could imagine a slightly pathological but not illegal case like this PDF, where a Form XObject is also included as a content stream in a page:

pathological_streams.pdf

I would suggest clarifying Sections 14.7.5 and 14.7.6 to explicitly define the scope of MCIDs, namely:

  1. The single content stream of a Form XObject (note, appearance streams are Form XObjects)
  2. The logical content stream for a page as defined by its Contents property (see Section 7.7.3.3, Table 31)

And (more) explicitly define the proper locations and semantics of StructParents:

  1. In the stream dictionary of a Form XObject
  2. In the page dictionary, for a page
  3. If StructParents is present in the stream dictionary of a content stream named in a page's Contents property, it is ignored (maybe not, see below?)

Finally the definition of the Stm property of marked-content reference dictionaries (Table 357) should just be slightly amended to clarify what "content stream for the page" means - I like the term "logical content stream" but perhaps there's a better one.

dhdaines avatar May 31 '25 02:05 dhdaines

Unfortunately there is still some ambiguity here, because if a page contains multiple content streams, or even if it has only one, these content streams could also have StructParents in their dictionary. What happens in that case? Are MCIDs scoped to the individual content streams or to the page? Should the stream's StructParents be used instead of the page dictionary? (I sincerely hope not...)

After some reflection, much like clouds, we have to look at MCID and StructParents from both sides now:

  1. When extracting structured text from a tagged PDF, we traverse the logical structure tree, and we wish to assign text and images to an element based on its K array.
  2. When parsing or rendering a PDF, we traverse a marked content section in a content stream, and we wish to find its logical structure parent, using the StructParents array.

Fundamentally, the relation between elements of the K list and marked content sections must be bijective, but the way the standard is written this isn't guaranteed.

For case (1) this isn't a huge problem, because we can always use marked-content reference dictionaries to refer to MCIDs in any page, any content stream, anywhere, and this is well described by the standard in Table 357, except that the language preceding this table should be made stronger, and clarified as noted previously, to restrict the case where we don't use marked-content reference dictionaries:

An integer that specifies the marked-content identifier. This may only be done in the common case where the marked-content sequence is contained in the content stream or streams listed in the Contents entry of the page dictionary that is specified in the Pg entry of the structure element dictionary and the marked-content identifiers are unique across all streams for this page.

Case (2) is the ambiguity described above. In the end, it's okay for stream objects to define their own StructParents, and even for multiple streams on the same page to use the same MCIDs. I don't know if this happens in real-world PDFs, but since the existing standard appears to allow it, I guess it should continue to be accepted, as long as the condition above on K lists is respected.

In addition, the restrictions on Form XObjects containing marked content sections in section 14.7.5.2 (page 732, before example 4) should be generalized to all content streams, so, instead of or addition to:

A form XObject that is painted with multiple invocations of the Do operator shall be incorporated into the document’s logical structure only by the first method, with each invocation of Do individually associated with a structure element.

The standard should say something like:

A content stream containing marked content sequences associated with logical structure elements shall never be rendered more than once in the document, either as an element of a Contents entry of a page or as a Form XObject painted with the Do operator.

dhdaines avatar Jun 03 '25 02:06 dhdaines

Finally, what should happen in the case where both a page object and one of its content streams contain a StructParents entry? Again, uncertain if this happens in real-world PDFs, and it's kind of a silly thing to do, but the existing standard appears to allow it.

Perhaps the StructParents in the content stream should be used if it exists, otherwise the one in the page, and if they both exist, then the overlapping portion is required to match? I can imagine this happening in the case of incremental updates, e.g.:

  • Page 3 exists, with a single content stream, which defines its own StructParents for MCIDs 0 through 5
  • An incremental update is made, creating a new page object for Page 3 and appending another content stream to its Contents which contains more marked content.
  • This time, StructParents is created in this page object (but not in the content stream), and the second content stream contains MCIDs 6 through 10.
  • So, the new StructParents should be required to duplicate the references to MCIDs 0 through 5 from the original content stream.

This actually seems like a good argument for allowing StructParents on individual content streams, since it would create a smaller output document and make the implementation easier - the second content stream can simply define its own MCIDs 0 through 5. There will just need to be a warning for PDF processors which "repair" PDFs that they must remember to renumber MCIDs and merge StructParents if they combine content streams for a page.

dhdaines avatar Jun 03 '25 03:06 dhdaines

I think the heart of this issue is confusion over whether a page that has /Contents as an array has a single Content Stream or multiple Content Streams. It's a fair question and a more general problem than simply the text in 14.7.6.

I've opened https://github.com/pdf-association/pdf-issues/issues/561 for that issue.

I've put some recommendations in there and I hope you can figure out the intent from the proposal - essentially, in most of 14.7.6 where it states "content stream" it really means "the page or xobject"*. If that definition is used I think most of the questions in this issue go away?

(*) because while patterns and Type3 glyphs etc may have MCIDs in their streams they can never be part of the structure tree, and while annotations can BE content in the structure tree, but cannot CONTAIN content. So the issue of MCIDs within a stream effectively only applies to pages and xobjects

faceless2 avatar Jun 03 '25 10:06 faceless2

I've put some recommendations in there and I hope you can figure out the intent from the proposal - essentially, in most of 14.7.6 where it states "content stream" it really means "the page or xobject"*. If that definition is used I think most of the questions in this issue go away?

Thanks! Yes, in that case, then, we would define StructParents as being a property of the "content stream owner", meaning a page or an xobject, and it would be forbidden on individual stream objects in a page's Contents array (and ignored if found there), which was my original proposal.

In this case we should still make it explicit that MCIDs are scoped to the content stream owner, thus, must be unique across all stream objects in a page. And I think it's also necessary to insist that a content stream object containing marked content should not be rendered multiple times in a document, just in case someone thinks it would be fun to include a Form XObject as an element of Contents, or more plausibly, use the same stream object as the contents for multiple pages.

dhdaines avatar Jun 03 '25 12:06 dhdaines

should still make it explicit that MCIDs are scoped to the content stream owner thus must be unique across all stream objects in a page

We've got the text "The marked-content sequence shall contain a property list (see 14.6.2, "Property lists") containing an MCID entry, which shall be an integer marked-content identifier that uniquely identifies the marked-content sequence within its content stream" in 14.7.5.2, so I'm hoping that "uniquely identifies" covers that.

And I think it's also necessary to insist that a content stream object containing marked content should not be rendered multiple times in a document.

This has come up before, see https://github.com/pdf-association/pdf-issues/issues/343 and https://github.com/pdf-association/pdf-issues/issues/544. Every "content stream owner" that has structural items must have a StructParent or StructParents which refers to an index in the ParentTree; the ParentTree maps those indices to structure elements or object references; and each of those elements has only one parent, a fact which applies transitively up to the root of the tree. The end result is that each content stream containing marked content must be in only one place in the tree - you can't build the tree any other way without breaking one of the other rules.

However this only becomes apparent once all those dots are joined. I recall finding this non-obvious when I was working through it, and the fact it keeps coming up probably means it should be more explicit. Perhaps an explanatory note near the top of 14.7 somewhere.

faceless2 avatar Jun 03 '25 13:06 faceless2

The standard should say something like:

A content stream containing marked content sequences associated with logical structure elements shall never be rendered more than once in the document, either as an element of a Contents entry of a page or as a Form XObject painted with the Do operator.

I support this as a concise requirement summarizing the implicit (current) restrictions required for consistency.

We might want to say "including but not limited to" for the examples list, or just remove it and leave it to the reader to semantically understand "rendered more than once".

Because, a bunch of different things can have streams, and those are rendered, sometimes optionally, via different inclusion methods, such as Annotation appearance streams, TilingPatterns, etc. And technically speaking, there's nothing preventing you from having StructureParent(s) on them.

That said, I'd also be fine if we wished to more strictly restrict the use of tagging, such that some of those are explicitly disallowed via white/blacklisting. E.g. why would you tag inside a tiling pattern?

Additionally, it might be good to give people suggestions on what to do if they do have a multiple-render use case, e.g. "you must duplicate each such XObject", though I'm not sure if that's the job of the PDF standard.

myang-apryse avatar Jun 05 '25 19:06 myang-apryse

while annotations can BE content in the structure tree, but cannot CONTAIN content.

@faceless2 Can you explain this a bit more? What about appearance streams?

myang-apryse avatar Jun 05 '25 19:06 myang-apryse