pdf-issues Indirect objects need to be single PDF basic objects only

I am trying to locate the requirement ("shall"/"shall not") in PDF which requires an "indirect object" (as referenced by an "indirect reference") to always be a single basic PDF object and not anything else. But I can't...

Consider an array: [ /a /b /c 10 0 R ] I hope everyone would agree that object 10 0 can only be a single PDF basic object type such as:

10 0 obj
/d
endobj

resulting in the array having 4 elements.

And that an indirect reference is not some form of expansion or continuation - so this would be invalid syntax and the array would not have 6 elements:

10 0 obj
/d /e /f 
endobj

The informal-ish definition of a PDF type of object comes from clause 7.3.1 which states:

PDF syntax includes nine basic types of objects: boolean values, integers, real numbers, strings, names, arrays, dictionaries, streams, and the null object.

Can someone identify the requirement?

May 22 '22 04:05 petervwyatt

Well,

10 0 obj
/d /e /f 
endobj

is not valid PDF syntax for PDF object.

If you instead had,

10 0 obj
[/d /e /f ]
endobj

That would be a single array object, of course. And then your example ([ /a /b /c 10 0 R ]) would become [ /a /b /c [/d /e /f] ]

May 23 '22 15:05 lrosenthol

@lrosenthol I totally agree with your statements - but I am having trouble identifying exactly which "shall" statements in 32K are being violated...

May 23 '22 23:05 petervwyatt

Seems pretty clear to me with 7.3.10 (32K-2)

Any object in a PDF file may be labelled as an indirect object. This gives the object a unique object identifier by which other objects can refer to it (for example, as an element of an array or as the value of a dictionary entry).

This sets it up that an indirect object is a standard object (as described in 7.3.2-7.3.9) which has a unique object identifier associated (by which other objects can refer to it).

This is followed by a clear normative statement:

The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white-space), followed by the value of the object bracketed between the keywords obj and endobj.

So we see here that an indirect object shall consist of the object number + generation number + the value of the object (bracketed by obj/endobj).

So your example:

10 0 obj
/d /e /f 
endobj

cannot be valid, because the syntax of /d /e /f is not the syntax for any valid object as defined in 7.3.2-7.3.9.

May 24 '22 01:05 lrosenthol

3.44, T&D for 'object' says "basic data structure from which PDF files are constructed and includes these 9 types: array, boolean, dictionary, integer, name, null, real, stream and string". The same is stated in the 1st para of 7.3.1.

Neither of those explicitly say "one of", but I don't think it's necessary, because "/d /e /f" does not match any one of those object types.

Can I ask what triggered this question?

Jun 30 '22 09:06 MPBailey

The question arose from SafeDocs when developing formal grammars of the PDF COS syntax for indirect references when used in array elements: whether the indirect reference always resolves to a single PDF object or could potentially be multiple PDF objects concatenated together (so more like a #include or macro expansion that an object reference).

I think the answer is obvious and we'd all agree, but cannot see where we actually concretely state it using normative language.

Jul 06 '22 07:07 petervwyatt

A related question is whether an indirect reference can be decomposed into 2 indirect references (each to an integer) followed by the keyword R: so 10 0 R 0 R which might then result in 99 0 R if object 10 was just the integer 99:

10 0 obj
99
endobj

Again I think the answer is obvious (no!) and we'd all agree, but cannot see where we actually concretely state it using normative language.

Jul 06 '22 07:07 petervwyatt

This is what happens when people experienced in writing parsers for programming languages and compilers read our spec and say "where does it state you cannot do that?"🤣 - Is a COS parser LALR(1), LL(1), LALR(2), something else? Is there no backtracking? Etc.

Jul 06 '22 07:07 petervwyatt

A related question is whether an indirect reference can be decomposed into 2 indirect references (each to an integer) followed by the keyword R: so 10 0 R 0 R which might then result in 99 0 R if object 10 was just the integer 99:

I think we can resolve that one by simply making a statement that object dereferencing is not part of parsing.

That said, I don't see how we can address this entire class of issues without putting together a formal grammar that at least nails down what constitutes valid PDF object syntax, i.e. a full definition of everything between an x y obj header and endobj). Note the emphasis on syntax: things like "making sure stream data is decodable" wouldn't be in scope. I also don't think we have to agree on specific parser types as long as the production rules of the language are unambiguous.

I don't think such a grammar is all that difficult to write down. Actually, didn't we already have one somewhere?

Jul 06 '22 07:07 MatthiasValvekens

If what you are referring to is the previous "grammar work" done in WG8 then that was only ever related to the PDF DOM objects (cf. the Arlington PDF Model - being Adobe DVA and a Levigo XText grammar), and not these kinds of lower-level lexical, tokenization, and syntactic rules. Sorry.

Jul 06 '22 10:07 petervwyatt

The definition of an indirect object in a PDF file shall consist of its object number and generation number (separated by white-space), followed by the value of the object bracketed between the keywords obj and endobj.
So we see here that an indirect object shall consist of the object number + generation number + the value of the object (bracketed by obj/endobj).

I agree that this text (7.3.10 Indirect objects) answers the original question. The data between obj and endobj must be the value of an object. The syntax for object values is defined in 7.3 (including the syntax for references in 7.3.10). The latter is made explicit in 7.3.1 General:

    Each object type, their method of creation and their proper referencing as indirect objects is described
    in 7.3.2, "Boolean objects" through 7.3.10, "Indirect objects".

The above is a long way of saying "the following subsections define the syntax for objects".

As to the question of whether the numbers in an indirect references are allowed to be references themselves (e.g. "1 2 R 3 R"), let's look at what 7.3.10 Indirect objects says:

    references shall consist of the object number, the generation number, and the keyword R (with white-
    space separating each part):

It may not be absolutely explicit, but it's pretty clear from the context that what is described here is the syntax of a reference and that the numbers mentioned are not themselves considered to be PDF objects that happen to be numbers in the sense of 7.3.3 Numeric objects. There are probably more instances of the word "number" used in this way.

Jul 11 '22 17:07 pesco

As noted in some other issues - you @pesco are trying to intermix parsing & tokenization with object referencing. PDF doesn't work that way - those things are completely separate operations in the job of a PDF processor. This is well stated right at the top of 7.2.1

At the most fundamental level, a PDF file is a sequence of bytes. These bytes can be grouped into tokens according to the syntax rules described in subclauses 7.2.2, “Representation” through 7.2.4, “Comments”. One or more tokens are assembled to form higher-level syntactic entities, principally objects, which are the basic data values from which a PDF file is constructed.

Aug 11 '22 18:08 lrosenthol

I was trying to provide my reasoning in support of your own conclusion, @lrosenthol. I don't know what you're trying to get at, to be honest. I can assure you, however, that I am aware of the distinction between parsing and lexical analysis.

Aug 11 '22 21:08 pesco

Indirect objects need to be single PDF basic objects only - where stated?