pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

Table Annex L.2 includes a requirement unrelated to standard structure elements

Open petervwyatt opened this issue 2 years ago • 34 comments

Spawned from Errata #308 and specifically this comment: https://github.com/pdf-association/pdf-issues/issues/308#issuecomment-1717729592

I'd never noticed that "StructTreeRoot" in Table Annex L.2 - thanks for pointing that out!

It's also not correct as to how the table captions itself, as StructTreeRoot is not a "... standard structure elements in the standard structure namespace ...". It's a dictionary. But it also clearly says there is a "precisely 1" relationship - there is no zero so it always needs to be present.

I think this should be removed and phrased in English, something like the following - possibly in/around Table 364:

For the structure element dictionary (Table 355) that has a structure type of Document, the value of the P entry in that structure element dictionary shall be an indirect reference to the structure tree root dictionary (Table 354). There shall always be exactly one such structure element dictionary for PDF files containing structure hierarchy.

petervwyatt avatar Sep 16 '23 05:09 petervwyatt

Looks good to me.

car222222 avatar Sep 16 '23 07:09 car222222

@petervwyatt while I agree that StructTreeRoot is odd, just as "content item" is odd, I'm not sure we should fix this. I particularly don't think the text is correct, since a PDF can contain more than one Document tag, so the requirement that all Structure Elements of subtype Document have a parent of StructTreeRoot is wrong.

mrbhardy avatar Sep 20 '23 23:09 mrbhardy

OK - I understand now based on our TWG discussions that you can have multiple Document struct-elems in a PDF.

But I think there is a requirement for the one (and only one) that represents the entire PDF document to have the StructTreeRoot as is parent - is that a correct statement? So there must always be one Document struct-elem with StructTreeRoot as a parent...

petervwyatt avatar Sep 21 '23 01:09 petervwyatt

This is also my understanding.

DuffJohnson avatar Sep 21 '23 01:09 DuffJohnson

Proposed rewording:

For all PDF documents containing structure hierarchy there shall be exactly one structure element dictionary with a structure type of Document and where the value of the P entry in that structure element dictionary shall be an indirect reference to the structure tree root dictionary (Table 354). Other structure element dictionaries with a structure type of Document may also be present.

petervwyatt avatar Sep 21 '23 03:09 petervwyatt

I’m not sure I see the need for this. I think we should keep StructTreeRoot in the tree and just allow the standard rules to ensure this relationship.

mrbhardy avatar Sep 21 '23 04:09 mrbhardy

In what tree? -- "keep StructTreeRoot in the tree"

car222222 avatar Sep 21 '23 05:09 car222222

Although the proposed rewording is now accurate, it could easily be read wrongly by misinterpreting the "and".

One way to avoid this possibility is to turn it around and state that:

the structure tree root dictionary must have exactly one entry in its /K key, which must be a structure element whose type is Document.

car222222 avatar Sep 21 '23 06:09 car222222

It is very important to remove any impression that the StructTreeRoot dictionary is itself either a structure element dictionary or, worse, a 'type of structure element'.

Such confusion has already caused problems!

car222222 avatar Sep 21 '23 06:09 car222222

I don't think that StructTreeRoot and the content item should be removed from the matrix. The root column in the matrix with its 1 is nice and clear and is something that can at least theoretically be parsed automatically by some validation code. I also never had a problem with understanding that StructTreeRoot and content item in this matrix are special elements, but if some clarification is needed, I would suggest to extend the caption to relationship between root, standard structure elements and content items.

Side remark: I miss math in the matrix.

u-fischer avatar Sep 21 '23 07:09 u-fischer

"math in the matrix"?? This sounds like a separate issue.

car222222 avatar Sep 21 '23 07:09 car222222

Such an extension to the caption might work well. But it is nevertheless vital to explain clearly (for all us others) that these two are not 'structure element types'.

car222222 avatar Sep 21 '23 07:09 car222222

Such an extension to the caption might work well. But it is nevertheless vital to explain clearly (for all us others) that these two are not 'structure element types'.

As long as they stay in the matrix I don't mind additional prose, but I would refer to the definitions in the relevant sections instead of repeating and rephrasing it. Duplicating text has always the danger to produce some inconsistency. I only want that the matrix stays complete, it is about the parent-child relationship in the structure tree, and it would imho cripple it if one remove the root and the leaf nodes.

u-fischer avatar Sep 21 '23 08:09 u-fischer

I like the idea of changing the caption and possibly also footnoting the "StructTreeRoot" entry since the column header says "Structure Type" - that would make things clearer.

Annex L is also normative so there is unlikely to be wording elsewhere such as in clause 14 which duplicates anything.

petervwyatt avatar Sep 21 '23 09:09 petervwyatt

The 1st bullet in clause 14.8.4.1 and clause 14.8.4.3 are the only places I can find where the Document struct-elem type is discussed. They use permissive phrasing such as "may" or "some other cases" and don't state any related requirements - the use of "may" is at odds with the "precisely 1" that is normatively stated in Annex L.

petervwyatt avatar Sep 21 '23 09:09 petervwyatt

One more thing to note: this unique child of the structure tree root is somewhat distinct from any other nodes of type Document in the tree.

This distinction should be explained somewhere, maybe along with an explanation of what it means for a pdf document to contain elements that are themselves distinct documents (if indeed this is the intention).

I guess that the compulsory top element should have been given a distinct type from all these other elements of type Document -- too late now!

car222222 avatar Sep 21 '23 09:09 car222222

Proposed solution for discussion at next PDF TWG:

  • Annex L.2, in column "Structure Type": add a footnote symbol to "StructTreeRoot" stating informative (i.e. as a Note) that the structure tree root (Table 354) is not a structure element type.

  • in 14.7.2, append to the end of the first para: "The structure tree root dictionary K entry shall contain a structure element whose type is Document."

    • This is a slight rewording of https://github.com/pdf-association/pdf-issues/issues/349#issuecomment-1728910607 which did not account for K being an array or a dictionary.
    • Alternatively could add to K cell in Table 354

petervwyatt avatar Oct 25 '23 03:10 petervwyatt

in 14.7.2, append to the end of the first para: "The structure tree root dictionary K entry shall contain a structure element whose type is Document."

Wait, what? No, that would be completely wrong. StructTreeRoot isn’t limited to just a document here. I think we’re confusing logical structure with tagged PDF.

StructTreeRoot can contain anything you want to put in there. Only in tagged PDF is it restricted.

mrbhardy avatar Oct 25 '23 04:10 mrbhardy

I’m not sure what we’re trying to fix here. Annex L is normative. It is what specifies that there’s only one Document in a Tagged PDF StructTreeRoot

mrbhardy avatar Oct 25 '23 04:10 mrbhardy

I'll add a little more context. Let's all remember that Logical Structure and Tagged PDF are not synonymous. First, a Logically Structured PDF is only a Tagged PDF if the Marked entry is set true in the MarkInfo entry in the document catalog dictionary. If this isn't set, StructTreeRoot can contain anything can contain anything and is just using private logical structure.

Beyond that, Annex L explicitly limits it scope to simply the standard structure elements in the standard structure namespace for PDF 2.0. Since both that and the standard structure namespace for PDF 1.7 are permitted in a 2.0 document, the Document element is only restricted if the element is in the PDF 2.0 namespace. The StructTreeRoot is welcome to have any legal PDF 1.7 namespace elements (e.g. Art).

mrbhardy avatar Oct 25 '23 17:10 mrbhardy

I'd therefore recommend accepting your note in Annex L that the StructTreeRoot is not an element type, but otherwise leave this alone.

mrbhardy avatar Oct 25 '23 17:10 mrbhardy

Maybe add something such as this to the footnote in Annex L:

the entry is used only to record that the structure tree root (dictionary) must have exactly one direct child, of type Document.

car222222 avatar Oct 26 '23 15:10 car222222

Not directly related, but it would be useful if the introduction to Annex L contained a reference back to where (14.8.4.2) it is introduced and explained.

Maybe I missed this?

car222222 avatar Oct 26 '23 15:10 car222222

the entry is used only to record that the structure tree root (dictionary) must have exactly one direct child, of type Document.

Is this true, given that Table 354 allows both a dictionary or an array for K? Does that mean that the array format must have only one element, or that it can have multiple array entries so long as exactly one entry is to a structure element of type Document, or is that at least one entry is to a structure element of type Document? I think it has to be exactly one element in the array to establish a properly rooted "tree" for the entire tagged PDF file - correct?? I don't think we want to prohibit the array format...


@mrbhardy - you're correct in that I proposed a fix in the wrong place as I only looked where the key is defined - apologies. But clause 14.8.4.3 Document level structure says nothing (normatively or informatively) about their needing to be precisely a single Document structure element in the structure tree root. At the very least an informative note in Table 364 Document cell to normative Annex L noting that something relevant is defined there would help. I think it is clear from this discussion and those in TWGs that not everyone appreciated this finer point and maybe what it means if attempting validation of the structure tree root object...

petervwyatt avatar Nov 14 '23 01:11 petervwyatt

I would agree with these two:

that there is no need to "prohibit the array format";

that, in this case,the array shall contain just this one element.

car222222 avatar Nov 14 '23 05:11 car222222

@mrbhardy wrote:

"The StructTreeRoot is welcome to have any legal PDF 1.7 namespace elements (e.g. Art)."

It would be interesting to know why the "mandatory unique Document element" was introduced in PDF 2.0.

Also, a more precise description of this freedom, and exactly when it applies, could usefully be added as a Note somewhere in the PDF 2.0 documentation.

car222222 avatar Nov 14 '23 05:11 car222222

Maybe ISO:32005 also needs some further explanation of exactly when the restrictions in its Table 5 apply -- especially since this table also contains the new "unique top Document element" rule, thus apparently (a priori) applying it also to documents containing "PDF 1.7 elements only".

Note that the introduction to Table 5 states:

PDF 1.7 elements and PDF 2.0 elements shall not have child or parent PDF 1.7 elements or PDF 2.0 elements that are not explicitly listed in Table 5.

which seems to imply that the table's restrictions apply to all documents, even those that are explicitly declared to be "PDF 1.7 only".

Therefore, it would also be useful to clarify very exactly what is meant by "use ... and ..." here:

This document specifies containment requirements for tagged PDF documents that use the PDF 1.7 namespace and the PDF 2.0 namespace.

for example, does "use" mean simply "contains an element that is from that namespace" or must the "use" of the namespace be declared?

does "and" mean that (for Table 5 to apply) the document must contain elements that are exclusive to both namespaces (i.e., at least one of each: in PDF1.7 only ; in 2.0 only) or must it contain just one of these two? or, at the other extreme, can it contain only elements that are in both? or, . . . many such interpretations are possible!

car222222 avatar Nov 14 '23 05:11 car222222

I think we should resolve this in the meeting tomorrow of UA. However, some with comments.

Maybe ISO:32005 also needs some further explanation of exactly when the restrictions in its Table 5 apply -- especially since this table also contains the new "unique top Document element" rule, thus apparently (a priori) applying it also to documents containing "PDF 1.7 elements only".

Please read 5.2 and then you’ll see why the above mashes no sense.

which seems to imply that the table's restrictions apply to all documents, even those that are explicitly declared to be "PDF 1.7 only".

Again, clearly answered by 5.2, though not sure what you mean by 1.7 only, since that’s an ambiguous statement. ISO 32005 only supplies to PDF 2.0 documents.

does "and" mean that (for Table 5 to apply) the document must contain elements that are exclusive to both namespaces (i.e., at least one of each: in PDF1.7 only ; in 2.0 only) or must it contain just one of these two? or, at the other extreme, can it contain only elements that are in both? or, . . . many such interpretations are possible!

All are clearly answered in 32005 if you read the rest of the spec rather than solely looking at the mega table.

mrbhardy avatar Nov 14 '23 06:11 mrbhardy

Thanks, @mrbhardy. Yes, indeed, I should read Section 5 more thoroughly!

Just one remaining suggestion:

The Scope Section could perhaps itself state explicitly that the scope is limited to documents explicitly declared to be PDF 2.0, with a reference to Section 5.1?

car222222 avatar Nov 14 '23 07:11 car222222

Thus the second paragraph of Scope could usefully start as follows:

These containment requirements apply only to documents versioned as PDF 2.0 (see 5.1 for details) and they extend . . .

car222222 avatar Nov 14 '23 07:11 car222222