pdf-issues
pdf-issues copied to clipboard
definition of "operator" in content streams
Section 7.8.2 (Content Streams) explains the term "operator" as follows:
An operator is a PDF keyword specifying some action that ...
but I did not find any definition of "PDF keyword" in the notes. It would be good to add a more explicit definition of what is a valid operator.
Maybe xxx is an operator, if and only if xxx is not empty and /xxx is a valid way to write a name? Assuming that "operator keyword" is the same as "operator", this would be consistent with
An operator keyword shall be distinguished from a name object by the absence of an initial SOLIDUS character (2Fh) (/).
(also from section 7.8.2.)
One aspect of this question is whether #63 is a valid way to write the c (curveto) operator.
Specifying what exactly is an "operator" would be useful for readers of content streams, where arbitrary operators can occur between BX and EX.
...Annex A ?
Great question.
Annex A is informative so doesn't help address this question.
I have also been investigating this exact same question. The other aspect is what to ignore between BX/EX compatibility operators as a "custom operator" vs a syntax error, which would give an error. Given the current set of rules for name objects, this would implies operators can be symbols such as $, #, @, etc. as well as having name #-escape codes.
All 1st class names that are in the PDF spec are also always written in normalized (post de-escaping) form (i.e. without #-escapes) even tho' it is entirely valid for 1st class name to be encoded with #-escapes. So this logic would also be expected to apply operators...
Another data point: Neither Adobe Acrobat Reader, nor MacOS Preview seem to accept #63 as an alternative to c. This might be implementation bugs, or a hint that operator names can't use #-escapes.
Looking at this question as well right now in relation to https://github.com/pdf-association/pdf-issues/issues/327 . For example, would such strings be operators or errors in the representation of numeric operands:
.
+1+
.0.
and so on.
PLRM stated "All characters besides the white-space characters and delimiters are referred to as regular characters. These include nonprinting characters that are outside the recommended PostScript ASCII character set." and "Any token that consists entirely of regular characters and cannot be interpreted as a number is treated as a name object (more precisely, an executable name). All characters except delimiters and white-space characters can appear in names, including characters ordinarily considered to be punctuation."
@petervwyatt PLRM definition still requires some modifications in PDF context to take into account arrays and dictionaries.
@bdoubrov - totally agree, but I was wondering if anything back in PS would help us with answering the question since operators are like PS executable names. But that answer is not really. The PDF definition already accounts for arrays and dictionaries through the delimiter chars.
Trying to converge towards a common understanding (this is NOT proposed wording!!):
Clause 7.8.2 includes "The instructions shall be represented in the form of PDF objects, using the same object syntax as in the rest of the PDF file.", "A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical conventions".", and "An operator is a PDF keyword ..."
so the solution appears to lie in defining a "PDF keyword" which would be a new subclause under 7.2 Lexical conventions: 7.2.5 PDF keywords. This needs to account for all existing PDF keywords and operators as currently described in ISO 32000 - which includes symbols * (as in T* operator). ` and " as well as digits (as in operators d0 and d1). All other keywords are upper or lower alpha (a-z, A-Z).
Thus a possible (broad) definition for "PDF keyword" comprises the printable ASCII characters from 33h (!) to 7Ah (z) inclusive AND that is not a defined delimiter (Table 2, including { and }). PDF keywords must also start with a character in this range that is also NOT a digit 0-9, and NOT + or - (so this makes parsing numbers immediately identifiable).
Temporarily labelled "proposed solution" to get feedback from PDF TWG.
Another thing to decide is whether #-escapes are allowed or not, in particular since # is in the proposed character range. Given that the viewers I tried seem to not support escaped operator names, and that there really is no need to escape operator names, my suggestion would be to not allow #-escapes in "PDF keywords" (i.e. #63 would be different from c). Maybe # should additionally be forbidden in "PDF keywords", to remove any ambiguity?
I would agree with a very big "NO!!" for any form of obfuscation or escaping for PDF keywords - # escapes are limited to PDF names and \ sequences are limited to literal strings as currently documented because my proposed method of defining "PDF keywords" is as an official lexical convention whereas those are objects.
Also, PDF keywords should not start with ., so that .3 can be a number.
Agreed! Thanks for that pickup.
Another thing to think about: are "true", "false" and "null" PDF keywords? I guess they can't be operator names (or are they?), but probably there is no harm in letting them be PDF keywords anyway?
Here is an alternative idea to consider: Maybe a PDF keyword is "any token which is not preceded by / and cannot be interpreted as a number"? Here, "token" is defined at the end of section 7.2.3 ("A sequence of consecutive
regular characters comprises a single token."), and the "cannot be interpreted as a number" clause matches the approach PostScript takes.
@seehuhn yes - all current keywords are currently described as "keywords" and I certainly do not want to change or have to maintain any of that for this errata (endobj, endstream, f, false, n, null, obj, R, startxref, stream, trailer, true, xref). That said, having this list of all official PDF keywords stated in the new keyword section would also help.
I also definitely want to avoid the "or otherwise treat as a keyword" kind of definition. PDF keywords should only comprise 7-bit chars that are fully visible that are not delimiters or potentially ambiguous for other tokens / objects. The spec already accepts (without explicitly stating) that there may well be invalid bytes encountered when lexing, tokenising, etc. as PDF is 8-bit binary.
"true", "false" and "null" are indeed very much distinct from (all?) other PDF keywords in that they themselves are actually (the names of??) well-defined objects and can be used wherever such an object is valid.
Note that, for this purpose, "PDF operator (names)", being merely syntactical, are not treated as objects.
@seehuhn Thanks for pointing out where the PDF idea of "a token" comes from!
@petervwyatt I don't see an actual proposal here...
PDF TWG discussed - suggest 2 new lexical conventions: PDF keyword (for the usual PDF keywords) and separately "content stream operator" as a better direction (this will change more wording in content stream section). New wording to be proposed...
Summarising the way forward (again, NOT precise wording):
-
new subclause: "PDF keywords" - case sensitive. No escape sequences. Comprise upper and lowercase ASCII characters only (no digits). Current set of defined. PDF keywords:
obj,endobj,stream,endstream,startxref,xref,trailer,R,true,false,f,n,null. -
new subclause: "Content stream operator" - case sensitive. No escape sequences. Comprise 1 or more consecutive printable ASCII characters from 33h (
!) to 7Ah (z) inclusive AND that is not a defined delimiter char (Table 2) AND is not a defined PDF keyword.
The rule for "Content stream operator" needs an additional clause to make sure that numbers cannot be mistaken for content stream operators. Should <<, >>, [, ] be PDF keywords? Otherwise looks good to me.
Content stream operators probably should be limited to appear in content streams.
Also:
AND is not a defined PDF keyword.
That is incorrect, there is both a keyword f and a content stream operator f. Similarly for n.
Yes - limiting to content streams also ensures no possible confusion so no need to mention "keywords" w.r.t "content stream operators". Thanks.
And in the current spec <<, [, etc are delimiters (a.k.a. token separators between keywords).
Thanks for the listing of the set of defined “PDF keywords”.
I was surprised at how short it is! :-)
A few consequences (but I may have misinterpreted what you wrote):
So no other tokens should, from now on, be described using the term “PDF keyword”.
But what about the more basic term “keyword”? Do we need a list of these too?
Or are “PDF keywords” and “keywords” identical entities?
And are all types of keyword “PDF objects”?
Also, from now, “content stream operators” are not “PDF Keywords”, so that the statements (more than one) in 7.8.2 need to change, for example it states:
An operator is a PDF keyword
Also, please confirm that these operators are “PDF Objects”!
Are all types of keyword also “PDF Objects”? This may need clarification somewhere.
@car222222 - trying to answer your questions:
So no other tokens should, from now on, be described using the term “PDF keyword”.
Yes.
But what about the more basic term “keyword”? Do we need a list of these too?
I reviewed all occurrences of "keyword" before making proposing the above. Besides the current confusion with operators (and a separate italicised use as a fragment identifier view parameter in Table Annex O.4), all other uses mean the same as "PDF keyword." I used "PDF keyword" for the exact reason you ask this question --> there are no other forms of keywords in PDF 😀
Or are “PDF keywords” and “keywords” identical entities?
Yes
And are all types of keyword “PDF objects”?
No. Keywords are not PDF objects according to clause 7.3. Objects can require/include certain keywords tho' - see 7.3.2 Boolean, 7.3.8 Stream objects 7.3.9 Null object, and 7.3.10 Indirect objects. Think of keywords and content stream operators more at a lexical or token level, rather than as objects which are a high-order concept (but I wouldn't want to state any of that in the spec).
Also, from now, “content stream operators” are not “PDF Keywords”, so that the statements (more than one) in 7.8.2 need to change, for example it states: An operator is a PDF keyword
Yes. If this proposal is accepted we will need to change this wording from "keywords" to "content stream operator". They are different things.
Also, please confirm that these operators are “PDF Objects”!
No - as above.
Are all types of keyword also “PDF Objects”? This may need clarification somewhere.
No - as above.
Hope that helps.
Many thanks for the clarifications!
I can now search for all the places that need changed wordings!:-)
It might be useful to explicitly state somewhere that "keyword" and "PDF keyword" are synonyms.
Yes. Or always use the full phrase "PDF keyword". All terms will also need to be added to the official T&Ds eventually. Up to the PDF TWG to decide.
Clause 7.3 will anyway need some extension: e.g., to add these "content stream operators" as an extra type of object.
Also, maybe to clarify that "null, true, false" are both "PDF Objects" and "PDF keywords" (or that, as "PDF keywords", they are only the "names of objects")?
No - The additions will be new subclasses to 7.2 lexical conventions.
true and false are keywords that represent a Boolean object with a specific value, in the same way as the null keyword represents the singular(!) null object. Yes, this may feel like semantics but it is an important distinction. The only changes necessary in 7.3 would be "keyword" --> "PDF keyword" if we wanted to bother.