pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

Comments in stream data

Open LegionMammal978 opened this issue 1 year ago • 34 comments

In ISO 32000-2:2020, 7.2.4 states,

Any occurrence of the PERCENT SIGN (25h) outside a string or inside a content stream (see 7.8.2, "Content streams") introduces a comment.

This implies that a percent sign inside a stream always introduces a comment. 7.2.3 states, "The rules defined in this subclause apply to all characters in the file except within strings, streams, and comments," but the statement in 7.2.4 does not appeal to the classification of characters defined in 7.2.3 and is not bound by its limitations. This is a breaking change from PDF-1.7, which instead states "outside a string or stream" in 7.2.3.

If a percent sign inside a non-content stream does not always introduce a comment, then there is still the question of whether a percent sign within the decoded data of an object stream can introduce a comment. 7.5.7 states that "the N objects are stored consecutively" in an object stream following the list of byte offsets. Does 7.2.4 apply to parsing these objects after decoding the stream data? 7.2.1 suggests that objects as syntactic entities are formed from tokenized bytes, using the ordinary syntax rules which accept comments:

At the most fundamental level, a PDF file is a sequence of bytes. These bytes can be grouped into tokens according to the syntax rules described in subclauses 7.2.2, "Representation" through 7.2.4, "Comments". One or more tokens are assembled to form higher-level syntactic entities, principally objects, which are the basic data values from which a PDF file is constructed.

However, if comments are permitted in object streams, then further clarification is needed in 7.5.7. In particular, what if one object in an object stream is trailed followed by a comment with no EOL marker, and the next byte offset points into that comment? For instance,

1 0 obj
<< /Type /ObjStm
   /Length 17
   /N 2
   /First 8
>>
stream
2 0 3 6
123 % 456
endstream
endobj

7.5.7 Note 7 suggests that "processing of each object in an object stream starts at the specified byte offset in the decompressed stream and ends prior to the byte offset of the next object or when the end of stream is encountered", which would permit this. But attempting to parse the list of objects in one go would skip object 3.

(There's also another question, of whether an object stream can begin with white-space, since the wording only explicitly permits white-space separating the integers specifying the byte offsets. But this may be adequately implied already by the ordinary syntax rules.)

LegionMammal978 avatar Apr 11 '23 16:04 LegionMammal978