pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

Cyclic vs acyclic linked list chains via dictionary keys is ambiguous

Open petervwyatt opened this issue 4 years ago • 23 comments

Many keys in PDF dictionaries are designed to create "linked lists" of objects of the same dictionary type. However, ISO 32000-2:2020 is commonly written such that the clarity of when cycles in such linked lists are and are not allowed is unclear. In many cases, this can only be inferred from what the dictionary data structure represents and whether a cycle makes any logical sense in that context. Cycles are also a common cause of crashes so clearly this is ambiguous and not commonly understood.

The word "acyclic" is used just once (Table 16, Extends key) to explicitly prohibit such cycles:

A given collection consists of a set of streams whose Extends links form a directed acyclic graph.

The phrase "infinite cycle" is used just once, in the NOTE under 12.6.4.4:

NOTE It is an error for a target dictionary to have an infinite cycle (for example, one where a target dictionary refers to itself). Interactive PDF processors need to attempt to detect such cases and refuse to execute the action if one is found.

It would be good if PDF 2.0 used consistent terminology/requirements for when cycles are and are not permitted. This may be as simple as adding simple statements such as "Cycles are permitted", "Cycles shall not be present"/"Data shall be acyclic" or similar...

This issue is to identify and gather those keys/dictionaries which need improved clarity regarding their cyclic / acyclic requirements.

petervwyatt avatar Jul 12 '21 23:07 petervwyatt

Table 172, Markup Annotations, IRT key (In Reply To): linked list of annotations. AFAICT makes no logical sense if cycles are permitted so should be acyclic.

petervwyatt avatar Jul 13 '21 00:07 petervwyatt

There are also scenarios where cycles are permitted and logical but implementers forget to address that particular scenario in that implementation (walking dictionaries with /Parent or /P keys being the most common).

lrosenthol avatar Jul 13 '21 13:07 lrosenthol

Yeap - but to me "back pointers" and "page pointers" are obvious places where a multitude of cycles will get introduced. They are mostly through a variety of keys and dictionaries (so the heterogeneous case) but, as you say, people still get this stuff incorrect so being more explicit won't hurt.

I was really focused on the less obvious homogeneous cases where lists of the same dict should or should not cause cycles. e.g. IRT in markup annots.

petervwyatt avatar Jul 13 '21 23:07 petervwyatt

I think this question primarily concerns tree-like structures in PDF. I found eight (!) kinds of them:

  • name tree
  • number tree
  • page tree
  • outlines tree
  • structure tree
  • action tree (as discussed in 12.6.2)
  • field tree (12.7.4.1)
  • selector rendition tree (13.2.3.3)

What we would like to ensure is that these structures do not contain cycles in the terminology of graphs.

The exact entries used in these structures to define parent-child relationship may differ. So, I would suggest to word this generically rather than adding some specific requirements for some entry values.

bdoubrov avatar Sep 30 '21 21:09 bdoubrov

I have been looking for any wording that would forbid infinite cyclic references such as:

8 0 obj
8 0 R
endobj

What should be the value of such an object?

pesco avatar Aug 01 '22 22:08 pesco

That is already covered in a few places:

  • dictionary keys cannot be indirect: "All keys shall be direct objects."
  • for traditional body objects in 7.3.10: "NOTE (2020) Any object outside of an object stream (see 7.5.7, "Object streams") can consist solely of an object reference. PDF syntax thus permits chains of such objects."
  • for objects in object streams: "An object in an object stream shall not consist solely of an object reference."

petervwyatt avatar Aug 01 '22 23:08 petervwyatt

@pesco so what you suggest is perfectly valid as an indirect object but only in objects in a traditional file body (no object streams). And the note warns developers to potentially expect cycles which in a number of features are explicitly required (e.g. outlines) or prohibited (e.g. object stream Extends)

petervwyatt avatar Aug 02 '22 01:08 petervwyatt

Right, so cyclic/infinite data structures are valid. Is the degenerate case of an object defined completely as itself also valid? If so, what is its type and value, in the same semantic sense that an indirect reference to, say, a number object is considered equivalent to that number object? Is it the null object? If so, is a reference to such an object valid wherever a null object is valid?

pesco avatar Aug 02 '22 16:08 pesco

Is the degenerate case of an object defined completely as itself also valid?

Do you mean just a simple direct object, such as 1 or (foo)?

or an indirect object version of each of those, such as:

1 0 obj
(foo)
endobj

in the same semantic sense that an indirect reference to, say, a number object is considered equivalent to that number object?

"equivalent" in what context? In the file format, they are not the same thing - there are direct objects and indirect objects and references. Only in application implementations - which are undefined - is there the consideration that references to indirect objects might be "equivalent", but that is out of scope.

lrosenthol avatar Aug 02 '22 16:08 lrosenthol

7.3.10 explicitly states "Except where documented to the contrary, any object value may be a direct or an indirect reference; the semantics are equivalent."

And by degenerate did you mean the following?:

10 0 obj
endobj

I cannot find explicit words covering this "empty object", but I think this would reduce to an implicit null object if it were the value of a dictionary key (for example). You cannot assume it is a particular type of PDF object so null seems the only logical conclusion.

petervwyatt avatar Aug 02 '22 20:08 petervwyatt

Yes, I meant equivalent in the sense of exactly the sentence you quoted, Peter, and I was talking about the example I gave of an (indirect) object being defined as a reference to itself. I called that a degenerate case of a cyclic data structure in that there is no data there.

My inner computer scientist says that the semantic value of this object should either be undefined, i.e. the object invalid, or null, i.e. a reference to nothing.

There is already this sentence in 7.3.10:

An indirect reference to an undefined object [...] shall be treated as a reference to the null object.

Which could be amended with something like "An indirect object cyclically defined (via indirect references) as itself shall be treated as the null object."

pesco avatar Aug 02 '22 20:08 pesco

@pesco - I haven't validated this as such, but I would consider your example below as invalid

8 0 obj
8 0 R
endobj

Not because it refers back to itself, but because there is no object there. I would consider a more normal version also problematic/invalid...

8 0 obj
9 0 R
endobj

9 0 obj
(string)
endobj

lrosenthol avatar Aug 02 '22 21:08 lrosenthol

I think the difference here is between something that is lexically/syntactically valid (all of @lrosenthol examples and my degenerate "empty object" example) and those which are semantically valid (e.g. the last 2 examples from @lrosenthol). But this runs very close to making certain assumptions about when a processor/parser makes such determinations - and we avoid this in PDF standards.

So although I agree with @lrosenthol conclusions, pointing to specific language in the PDF spec is a different matter - as is creating new wording that doesn't imply specific implementation or design choices.

petervwyatt avatar Aug 03 '22 05:08 petervwyatt

@lrosenthol If indeed

8 0 obj
8 0 R
endobj

is invalid, which was my first intuition, the spec does not make this clear. Your second example (an acyclic chain of references), however, is quite explicitly allowed by one of the statements Peter quoted earlier ("PDF syntax thus permits chains of [object references]").

@petervwyatt Your "empty object" is certainly not syntactically valid, unless your definition of syntax is extremely bare. Quoting 7.3.10 "Indirect objects":

The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white-space), followed by the value of the object bracketed between the keywords obj and endobj.

Emphasis mine. "The definition ... shall consist of ... (separated by white-space) ... followed by ..." is clearly a syntactic rule in the common sense of the word.

But you are right that the original question (concerning a purely cyclic object definition) is not a syntactic one. Hence my asking what the (semantic) value and/or type of an object like 8 0 in the example above should be.

pesco avatar Aug 03 '22 13:08 pesco

Your second example (an acyclic chain of references), however, is quite explicitly allowed by one of the statements Peter quoted earlier ("PDF syntax thus permits chains of [object references]").

It's not - because the first object (8) is invalid in the same way that other is. The fact that it is referring to itself is NOT the invalid part - it's the reference doing nothing...

PDF doesn't really have this type of referencing - what I might call "aliasing".

FWIW: I am trying to confirm how some "well known implementations" react to this scenario...

lrosenthol avatar Aug 03 '22 15:08 lrosenthol

OK - I put together a quick sample file (attached) and tried a few known implementations.

  • Acrobat (desktop, mobile and web and any other product based on our PDF Libraries) will error out on the file - more on this in a second.
  • Apple Preview (and anything based on Quartz) has no problem with the file.
  • Dropbox's PDF Preview (which I believe is based on PSPDFKit has no problem with the file
  • FoxIt (Mac) has no problem with the file.
  • Chrome (and assumably pdfium) has no problem with the file.
  • MuPDF (iOS) has no problem with the file.

HelloWorld_1x3.pdf

lrosenthol avatar Aug 03 '22 18:08 lrosenthol

So why does Acrobat fail on the file...

Our parser is designed to only expect to find indirect references in certain places (i.e. as entries in an array and as the values of dictionary entries). An indirect reference like this is therefore unexpected. I will note that this behavior was added to our parser 20 years ago (2002, if anyone cares) in order to address bad files coming out of a popular producer of the time.

If anyone cares about more specifics, using the example PDF, the parser reads the value as the integer object 7 (the objID in the 7 0 R). BUT then the parser sees the next "token" in the stream (the 0) and it is not something it expects, and so it (correctly) throws an error.

lrosenthol avatar Aug 03 '22 18:08 lrosenthol

Replying to @lrosenthol (my definition of works is that it displays "Hello World"):

  • on Windows, Acrobat DC 32-bit displays a blank page but no warning message. It also doesn't ask to save on exit. However if I preflight with "Syntax Check" then it does detect the error and complains (and not complete the syntax check preflight)
  • Nitro Pro PC works
  • Foxit Reader PC works
  • Tracker Software PDF-XChange works
  • Qoppa PDF Studio 2022 works
  • pdfforge PDF Architect works
  • pdftron Xodo displays a blank page but no error message
  • Xpdf Reader 4.04 works
  • UPDF works
  • PDFgear works
  • Slim PDF Reader works
  • Javelin PDF Reader works
  • pdfium (in Chrome) works
  • pdf.js (in FF) works
  • PDF Ink shows a blank page but no error message

petervwyatt avatar Aug 04 '22 10:08 petervwyatt

Replying to @pesco: yes, my meaning was meant in a very "bare" sense, as you put it, since the PDF file format spec cannot assume anything about lexer/parser design: breadth-first, depth-first, lazy, contextual vs context-free, etc.

I think we can definitely improve the wording around what the "value" part of an object means, what is valid/invalid, and how this may or may not "reduce" to an implicit null object.

petervwyatt avatar Aug 04 '22 10:08 petervwyatt

Correcting for wrong comment in #208: Follow up from PDF TWG discussions where we noted that some PDF software don't read /Length keys (they instead work it out for themselves): I made a new version of the test PDF HelloWorld_v2.pdf but with the chain of indirect references for the page content stream.

Results from my testing are:

  • Adobe Acrobat - blank page
  • Tracker Software PDF-XChange Editor - blank page
  • Nitro Pro - works, but says file was repaired
  • Foxit PDF Reader - works
  • Qoppa PDF Studio 2022 - works
  • PDF Architect 8 - blank page
  • Xpdf Reader 4.04 - blank page
  • pdfTron Xodo - blank page
  • callas pdfPilot - blank page
  • Javelin - blank page
  • Slim PDF - blank page
  • UPDF - blank page
  • PDFgear - blank page
  • qpdf --check - gets confused with odd warning messages
  • mutool clean - no problems detected
  • veraPDF features - no problems detected
  • iText RUPS - no problems detected
  • Apache PDFBox Debugger - blank page with warning message
  • pdf.js (FF) - blank page
  • pdfium (Chrome) - blank page
  • pdfium (Edge) - blank page

petervwyatt avatar Aug 18 '22 10:08 petervwyatt

I think the difference here is between something that is lexically/syntactically valid (all of @lrosenthol examples and my degenerate "empty object" example) and those which are semantically valid (e.g. the last 2 examples from @lrosenthol). But this runs very close to making certain assumptions about when a processor/parser makes such determinations - and we avoid this in PDF standards.

So although I agree with @lrosenthol conclusions, pointing to specific language in the PDF spec is a different matter - as is creating new wording that doesn't imply specific implementation or design choices.

Section 7.3.8.1 has this sentence: "All streams shall be indirect objects (see 7.3.10, "Indirect objects") and the stream dictionary shall be a direct object." Does this explicit mentioning of the dictionary object needing to be a direct object imply that indirect references are usually allowed in indirect objects? I.e. that

8 0 obj
9 0 R
endobj

is allowed?

gettalong avatar Oct 06 '24 21:10 gettalong

That wording is to avoid a malformed stream object where the stream extent dictionary is referenced via an indirect reference - so this is invalid:

8 0 obj
9 0 R
stream
...
endstream

There is a NOTE in 7.3.10 which states "NOTE (2020) Any object outside of an object stream (see 7.5.7, "Object streams") can consist solely of an object reference. PDF syntax thus permits chains of such objects." so in the general case and subject to other requirements, yes, chains of indirect references are allowed.

petervwyatt avatar Oct 06 '24 23:10 petervwyatt

so in the general case and subject to other requirements, yes, chains of indirect references are allowed.

Thanks - time to fix my library 😅

gettalong avatar Oct 07 '24 15:10 gettalong