"Content stream" for a page is ambiguously defined.
There is ambiguity over whether a page that has an array with multiple items for /Contents has a single content stream or multiple content streams - i.e. is each single stream in the array a "content stream", or is there a single logical "content stream" which is the concatenation of the contents of the array?
Most of ISO32000 assumes that "content stream" is a single stream formed from the concatenation of the parts; that's the way https://github.com/pdf-association/pdf-issues/issues/9 was resolved, and this usage is fairly consistent throughout the rest of ISO32000. However there are exceptions: specifically section 7.8.2 implies that each item in the array is a stream in itself with the "one or more" in this text:
Each page of a document shall be represented by one or more content streams.
and in section 3.12 we have a definition of "content stream" as "a stream object..." ("content stream" is later re-defined more fully in 7.8.2)
This is not just academic; it's causing confusion, see https://github.com/pdf-association/pdf-issues/issues/557
If the definition of "content stream" is the logical concatenation of any parts, then we have several following areas of ISO32000 that need changes. My suggestions:
-
Strike paragraph 3.12 completely, and replace any references to it to 7.8.2.
-
We claim to define "content stream" in section 7.8.2, but don't mention that for a page it can be broken into multiple sections; that's only noted in table 31. I suggest we move that text from table 31 to section 7.8.2, so it can't be missed. Table 31 could read
Content (stream or array) (Optional) A content stream (see 7.8.2, "Content streams") that shall describe the contents of this page. If this entry is absent, the page shall be empty.
And section 7.8.2 could change from:
Each page of a document shall be represented by one or more content streams.
to:
Each page of a document shall be represented by a single content stream, specified as a single stream or an array of streams... remainder of text from Table 31 goes here
-
Table 354 definition of "ParentTree" includes the text "For a page object or content stream containing marked-content sequences that are content items...", which implies it could apply to the page or the page's content stream separately. It could be reworded to "For page or other object with a content stream containing marked-content sequences...". Alternatively if "content stream owner" is defined, as in the next suggestion, we could use that term.
-
Section 14.7.5.4 is the worst offender - throughout this section "content stream" is often used to mean an item with a content stream, such as a page or xobject. It might be useful here to define the term "content stream owner" as an item with a content stream - a page or xobject - and then use that where "content stream" is currently used incorrectly, which is:
- first paragraph on p735 (replace "each content stream containing" with "each content stream owner containing")
- paragraph preceding table 359 (replace "each object or content stream" with "each content stream owner")
- table 359 (replace "Required for all content streams" with "Required for all content stream owners")
- only paragraph on p736 (replace "the page object or other content stream" with "the content stream owner", and "within that content stream" with "within that object's content stream")
-
Trap Network annotations: Table 403 refers to "All content streams identified in the page object's Contents". This either needs changing to refer to "the page's content stream", or if the individual stream components really do need to be referenced separately then that needs to be made explicit, and the term "page's content stream" should not be used. I have never seen one of these and I very much doubt I ever will, so I'm not sure what's intended here.
-
Finally, we have a section (two sections!) defining "content stream", yet throughout the text it's uncapitalised as "content stream". I'm venturing into ISO territory I definitely don't understand here, but to me it feels like it if we define it, it should be "Content Stream" throughout - particularly as a "Content Stream" may actually be stored as an array containing a sequence of streams.
In Ye Olde Days, when content streams could end in the middle of a lexeme and q/Q pairs did not have to be matched, it was clear that stream objects which are part of a /Contents array could not be processed individually - not even parsed! The array had to be considered as a whole, or at least checked through for irregularities.
ISO PDF has banned unbalanced q/Q pairs, and insists streams end on lexical boundaries. But I still think you could construct a pair of streams where you can't parse the first without parsing the second - something to do with a split inline image probably.
So I don't see there's any choice but to define "content stream" to mean "concatenation of all elements of /Contents with a spaces in between". A single /Contents entry is really not a processable object absent its context as part of the page's whole Content Stream, so it shouldn't have a name, and all references to it are suspect.
Thanks for this! There's a few other places this ambiguity strikes as well, notably to do with graphics and text state, e.g. 9.3.1, which probably refers to the concatenation of streams in a page (or to a single xobject):
The text state operators may appear outside text objects, and the values they set are retained across text objects in a single content stream
Searching for "single content stream" also reveals this in 14.6.1, where it's really not clear (perhaps the intent is to say that marked content sections can't be split between pages or xobjects? but it already says that elsewhere...):
Marked- content sequences may be nested one within another, but each sequence shall be entirely contained within a single content stream.
The text state operators may appear outside text objects, and the values they set are retained across text objects in a single content stream
That's one of the sentences I didn't quote because they supported the "content streams are the sum of their parts" approach! If you had a content stream divided into two physical parts (two streams), and you set the font in the first part, it remains set in the second part. Any other interpretation would be nonsensical. Most of references to content streams are compatible with this world view, where that's the case I haven't noted them. Of course I may have missed some.
Re marked-content sequences, I read this as "you can't start a sequence in a Page and end it in an XObject referenced by the page" (for example). I agree it's a slightly redundant statement.
Linking to various related errata: #9, #201, #208, #363
I suspect we need a new special term for the partial stream thing that the page Contents array form supports. This will make wording much easier - esp. if we can keep the current "content stream" phrase reserved for the holistic combination and non-array Contents (I say this pragmatically since that will mean much fewer edits for me to patch in as errata into the current text).
Building on the opening para of 7.8.2 Content streams:
A content stream is a PDF stream object whose data consists of a sequence of instructions describing the graphical elements to be painted on a page. The instructions shall be represented in the form of PDF objects, using the same object syntax as in the rest of the PDF file. However, whereas the file as a whole is a static, random-access data structure, the objects in the content stream shall be interpreted and acted upon sequentially.
how about "instruction stream"? So a page "content stream" (in array form) is a concatenation of "instruction streams" and when there is just one "instruction stream" it SHALL be a "content stream". The phrase "Instruction streams" alone does not infer anything about where breaks between "instruction streams" can occur - we obviously need to specify this separately as SHALL statements.
how about "instruction stream"?
I don't have a better proposal. But to me it is not intuitively clear which of "content stream" and "instruction stream" describes the actual COS stream object and which describes the abstraction.
how about "instruction stream"?
I don't have a better proposal. But to me it is not intuitively clear which of "content stream" and "instruction stream" describes the actual COS stream object and which describes the abstraction.
"Content instruction stream" (CIstream?) may provide the desired distinction, as it builds on "content stream" rather than (potentially) appearing competitive?
#kibitzing
"content stream element"?
"Each page of a document shall be represented by one or more content streams" would have to be change to: "Each page of a document shall be represented by one content stream which may consist of several content stream elements".
I was trying to avoid phrases such "X content stream" or "content stream X" since such phrasing might be ambiguous - a completely different phrase was 'safer' in my mind...
content sub-stream is the best I can come up with I think.
Well, an option would be something like "individual content stream object" if that wasn't such a long term.
A content stream is an individual content stream object or an array of individual content stream objects, ...
Pretty long-winded...
Flagging as "proposed solution" for an open discussion at next PDF TWG virtual meeting.