Possible clarification on requirement for Cross Reference Streams including themselves.
In a recent email thread, I raised the question of whether the 'Size' given in a cross reference stream had to include the cross reference stream item itself, especially because such cross reference stream objects (in common with linearisation objects) are not actually linked into the DOM. There are many files out there (produced by a popular tool) that do NOT follow this.
In the course of this thread, the question was further raised as to whether a cross reference object should specifically include itself.
This was discussed briefly at the Feb 13th TWG meeting (in passing as issue #522 was resolved), and I was asked to open an issue for further consideration.
One participant pointed us to the language at the end of 7.5.8.3 which appears to cover most of this ground:
"Like any stream, a cross-reference stream shall be an indirect object. Therefore, an entry for it shall exist in either a cross-reference stream (usually itself) or in a cross-reference table (in hybrid-reference files; see 7.5.8.4, "Compatibility with applications that do not support compressed reference streams")."
Personally, on reflection, I feel that's quite clear. I have nonetheless opened this issue as requested, in case anyone feels differently, or has ideas for improving the language.
I also think that the paragraph at the end of 7.5.8.3 clearly expresses that a cross reference stream must be referenced from the document cross references, and that it may (but doesn't have to) itself contain that entry.
Furthermore, it illuminates what the authors had in mind when specifying the cross reference stuff: for every indirect object in a PDF an entry in some cross reference table or stream must exist. It doesn't matter whether or not there is an indirect reference for it anywhere.
PDF processors may read PDFs in a more relaxed manner but they should respect this when writing a PDF.
Somewhat related: in a "hybrid-reference PDF" the conventional trailer dictionary will have an XRefStm entry. If this hybrid-reference file is then incrementally updated, current wording in 7.5.6 requires that the new trailer must also include the same XRefStm entry!
... The added trailer shall contain all the entries except the Prev entry (if present) from the previous trailer, whether modified or not. In addition, the added trailer dictionary shall contain a Prev entry giving the location of the previous cross-reference section (see "Table 15 — Entries in the file trailer dictionary"). Each trailer shall be terminated by its own end-of-file (%%EOF) marker. ...
This is wrong for XRefStm. There is no requirement that incremental updates of a hybrid-reference file need to retain the hybrid-ness (since there is no way to know if the SW doing the update is PDF 1.5 aware or not). The incremental update could use either a conventional cross-reference section (which may or may not have known about the XRefStm objects!), or a cross-reference stream (which would definitely know about the XRefSrm objects), or (if it was very tricky) a hybrid-reference incremental update. If the old XRefStm is blindly copied into the trailer of the incremental update then some objects could get the "older" definitions when used with PDF 1.5 processors since the order of processing is mandated. Thus the trailer entries in a cross-reference stream dictionary should also never include an XRefStm entry.
Back on topic...
My (minor) concern with the current wording:
Like any stream, a cross-reference stream shall be an indirect object. Therefore, an entry for it shall exist in either a cross-reference stream (usually itself) or in a cross-reference table (in hybrid-reference files; see 7.5.8.4, "Compatibility with applications that do not support compressed reference streams").
is that there is an implication that the cross-reference table is only(!) applicable to hybrid-reference files from the parenthetical as there is no "... (for example, ...)" or "... (including ...)" so this ignores the possibility of incremental updates using cross-reference streams.
It would be better to finish the sentence with a full stop after "... in a cross-reference table." to make it unconditionally true all the time. Then replace the parenthetical with a new sentence "This includes both for incremental updates (see 7.5.6, "Incremental updates") and hybrid-reference files (see 7.5.8.4, "Compatibility with applications that do not support compressed reference streams")."
The original wording makes sense if you assume that PDF files with cross reference tables (with or without XRefStm) can only be incrementally updated using cross reference tables, and that PDF files with pure cross reference streams can only be incrementally updated using pure cross reference streams.
Such a limitation is not explicitly in 32K. Interestingly, though, Leonard (or I think it was him, considering the user name) once has said this limitation does exist:
Yes, you can NOT "cross the streams" (to quote the classic movie phase).
If the original PDF uses classic xrefs, you need to use the same at append time. If it uses streams, you need to use streams.
(Comment to "Adding XRef table to PDF w/ XRef streams?" on the old Adobe forum in 2012; the link uses the wayback machine)
It may be possible to derive this from the sense and purpose of hybrid streams presented in section 7.5.8.4.
If we agree, though, that this limitation does not exist, you're completely right and we should adapt the wording here.
Unfortunately extant data doesn't support that position... there are PDFs out there like that. But I do agree as an industry we need to come to an agreed position.
Albeit "hybrid-reference" PDFs don't really need to be created anymore since the majority of PDF software now supports PDF 1.5 and can correctly handle (pure) cross-reference streams -- but, again, I also know this NOT to be the case!
Let's discuss in a future PDF TWG and see what the consensus is (temporarily labeling as "proposed solution" for this reason only).
Unfortunately extant data doesn't support that position... there are PDFs out there like that.
Do you by chance have an example at hand?
If I recall correctly, at the time of that forum response by lrosenth a well-known implementation indeed did not accept mixed xref PDFs. So it might be interesting to check support for them in current PDF processors.
If I recall correctly, at the time of that forum response by lrosenth a well-known implementation indeed did not accept mixed xref PDFs. So it might be interesting to check support for them in current PDF processors.
That is correct - it's like Ghostbusters...don't mix/cross the streams!
Summary of agreements from discussions at Concord PDF TWG Meeting: For incremental updates (covering both the original file + a first incremental update and multiple incremental updates):
- shall never "downgrade" from cross-reference streams to conventional cross-reference tables
- should not "upgrade" conventional cross-reference tables to cross-reference streams
Note that hybrid-reference PDFs have both types of cross-references, so we need to be careful with wording not to invalidate this (even if we have a desire to deprecate this - see Errata #115!)
To-Do: work out some proposed new wording and where to insert new words in 32K-2.