pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

definition of "operator" in content streams

Open seehuhn opened this issue 2 years ago • 43 comments

Section 7.8.2 (Content Streams) explains the term "operator" as follows:

An operator is a PDF keyword specifying some action that ...

but I did not find any definition of "PDF keyword" in the notes. It would be good to add a more explicit definition of what is a valid operator.

Maybe xxx is an operator, if and only if xxx is not empty and /xxx is a valid way to write a name? Assuming that "operator keyword" is the same as "operator", this would be consistent with

An operator keyword shall be distinguished from a name object by the absence of an initial SOLIDUS character (2Fh) (/).

(also from section 7.8.2.)

One aspect of this question is whether #63 is a valid way to write the c (curveto) operator.

Specifying what exactly is an "operator" would be useful for readers of content streams, where arbitrary operators can occur between BX and EX.

seehuhn avatar Nov 08 '23 12:11 seehuhn

...Annex A ?

datalogics-pgallot avatar Nov 08 '23 15:11 datalogics-pgallot

Great question.

Annex A is informative so doesn't help address this question.

I have also been investigating this exact same question. The other aspect is what to ignore between BX/EX compatibility operators as a "custom operator" vs a syntax error, which would give an error. Given the current set of rules for name objects, this would implies operators can be symbols such as $, #, @, etc. as well as having name #-escape codes.

All 1st class names that are in the PDF spec are also always written in normalized (post de-escaping) form (i.e. without #-escapes) even tho' it is entirely valid for 1st class name to be encoded with #-escapes. So this logic would also be expected to apply operators...

petervwyatt avatar Nov 08 '23 23:11 petervwyatt

Another data point: Neither Adobe Acrobat Reader, nor MacOS Preview seem to accept #63 as an alternative to c. This might be implementation bugs, or a hint that operator names can't use #-escapes.

seehuhn avatar Nov 09 '23 07:11 seehuhn

Looking at this question as well right now in relation to https://github.com/pdf-association/pdf-issues/issues/327 . For example, would such strings be operators or errors in the representation of numeric operands:

.
+1+
.0.

and so on.

bdoubrov avatar Nov 09 '23 09:11 bdoubrov

PLRM stated "All characters besides the white-space characters and delimiters are referred to as regular characters. These include nonprinting characters that are outside the recommended PostScript ASCII character set." and "Any token that consists entirely of regular characters and cannot be interpreted as a number is treated as a name object (more precisely, an executable name). All characters except delimiters and white-space characters can appear in names, including characters ordinarily considered to be punctuation."

petervwyatt avatar Nov 09 '23 10:11 petervwyatt

@petervwyatt PLRM definition still requires some modifications in PDF context to take into account arrays and dictionaries.

bdoubrov avatar Nov 10 '23 09:11 bdoubrov

@bdoubrov - totally agree, but I was wondering if anything back in PS would help us with answering the question since operators are like PS executable names. But that answer is not really. The PDF definition already accounts for arrays and dictionaries through the delimiter chars.

petervwyatt avatar Nov 10 '23 22:11 petervwyatt

Trying to converge towards a common understanding (this is NOT proposed wording!!):

Clause 7.8.2 includes "The instructions shall be represented in the form of PDF objects, using the same object syntax as in the rest of the PDF file.", "A content stream, after decoding with any specified filters, shall be interpreted according to the PDF syntax rules described in 7.2, "Lexical conventions".", and "An operator is a PDF keyword ..."

so the solution appears to lie in defining a "PDF keyword" which would be a new subclause under 7.2 Lexical conventions: 7.2.5 PDF keywords. This needs to account for all existing PDF keywords and operators as currently described in ISO 32000 - which includes symbols * (as in T* operator). ` and " as well as digits (as in operators d0 and d1). All other keywords are upper or lower alpha (a-z, A-Z).

Thus a possible (broad) definition for "PDF keyword" comprises the printable ASCII characters from 33h (!) to 7Ah (z) inclusive AND that is not a defined delimiter (Table 2, including { and }). PDF keywords must also start with a character in this range that is also NOT a digit 0-9, and NOT + or - (so this makes parsing numbers immediately identifiable).

petervwyatt avatar Jun 05 '24 01:06 petervwyatt

Temporarily labelled "proposed solution" to get feedback from PDF TWG.

petervwyatt avatar Jun 05 '24 01:06 petervwyatt

Another thing to decide is whether #-escapes are allowed or not, in particular since # is in the proposed character range. Given that the viewers I tried seem to not support escaped operator names, and that there really is no need to escape operator names, my suggestion would be to not allow #-escapes in "PDF keywords" (i.e. #63 would be different from c). Maybe # should additionally be forbidden in "PDF keywords", to remove any ambiguity?

seehuhn avatar Jun 05 '24 06:06 seehuhn

I would agree with a very big "NO!!" for any form of obfuscation or escaping for PDF keywords - # escapes are limited to PDF names and \ sequences are limited to literal strings as currently documented because my proposed method of defining "PDF keywords" is as an official lexical convention whereas those are objects.

petervwyatt avatar Jun 05 '24 06:06 petervwyatt

Also, PDF keywords should not start with ., so that .3 can be a number.

seehuhn avatar Jun 05 '24 06:06 seehuhn

Agreed! Thanks for that pickup.

petervwyatt avatar Jun 05 '24 06:06 petervwyatt

Another thing to think about: are "true", "false" and "null" PDF keywords? I guess they can't be operator names (or are they?), but probably there is no harm in letting them be PDF keywords anyway?

seehuhn avatar Jun 05 '24 08:06 seehuhn

Here is an alternative idea to consider: Maybe a PDF keyword is "any token which is not preceded by / and cannot be interpreted as a number"? Here, "token" is defined at the end of section 7.2.3 ("A sequence of consecutive regular characters comprises a single token."), and the "cannot be interpreted as a number" clause matches the approach PostScript takes.

seehuhn avatar Jun 05 '24 08:06 seehuhn

@seehuhn yes - all current keywords are currently described as "keywords" and I certainly do not want to change or have to maintain any of that for this errata (endobj, endstream, f, false, n, null, obj, R, startxref, stream, trailer, true, xref). That said, having this list of all official PDF keywords stated in the new keyword section would also help.

I also definitely want to avoid the "or otherwise treat as a keyword" kind of definition. PDF keywords should only comprise 7-bit chars that are fully visible that are not delimiters or potentially ambiguous for other tokens / objects. The spec already accepts (without explicitly stating) that there may well be invalid bytes encountered when lexing, tokenising, etc. as PDF is 8-bit binary.

petervwyatt avatar Jun 06 '24 00:06 petervwyatt

"true", "false" and "null" are indeed very much distinct from (all?) other PDF keywords in that they themselves are actually (the names of??) well-defined objects and can be used wherever such an object is valid.

Note that, for this purpose, "PDF operator (names)", being merely syntactical, are not treated as objects.

car222222 avatar Jun 06 '24 04:06 car222222

@seehuhn Thanks for pointing out where the PDF idea of "a token" comes from!

car222222 avatar Jun 06 '24 04:06 car222222

@petervwyatt I don't see an actual proposal here...

lrosenthol avatar Jun 06 '24 18:06 lrosenthol

PDF TWG discussed - suggest 2 new lexical conventions: PDF keyword (for the usual PDF keywords) and separately "content stream operator" as a better direction (this will change more wording in content stream section). New wording to be proposed...

petervwyatt avatar Jun 06 '24 20:06 petervwyatt

Summarising the way forward (again, NOT precise wording):

  • new subclause: "PDF keywords" - case sensitive. No escape sequences. Comprise upper and lowercase ASCII characters only (no digits). Current set of defined. PDF keywords: obj, endobj, stream, endstream, startxref, xref, trailer, R, true, false, f, n, null.

  • new subclause: "Content stream operator" - case sensitive. No escape sequences. Comprise 1 or more consecutive printable ASCII characters from 33h (!) to 7Ah (z) inclusive AND that is not a defined delimiter char (Table 2) AND is not a defined PDF keyword.

petervwyatt avatar Jun 09 '24 08:06 petervwyatt

The rule for "Content stream operator" needs an additional clause to make sure that numbers cannot be mistaken for content stream operators. Should <<, >>, [, ] be PDF keywords? Otherwise looks good to me.

seehuhn avatar Jun 09 '24 09:06 seehuhn

Content stream operators probably should be limited to appear in content streams.

Also:

AND is not a defined PDF keyword.

That is incorrect, there is both a keyword f and a content stream operator f. Similarly for n.

mkl-public avatar Jun 09 '24 10:06 mkl-public

Yes - limiting to content streams also ensures no possible confusion so no need to mention "keywords" w.r.t "content stream operators". Thanks.

And in the current spec <<, [, etc are delimiters (a.k.a. token separators between keywords).

petervwyatt avatar Jun 09 '24 12:06 petervwyatt

Thanks for the listing of the set of defined “PDF keywords”.

I was surprised at how short it is! :-)

A few consequences (but I may have misinterpreted what you wrote):

So no other tokens should, from now on, be described using the term “PDF keyword”.

But what about the more basic term “keyword”? Do we need a list of these too?

Or are “PDF keywords” and “keywords” identical entities?

And are all types of keyword “PDF objects”?

Also, from now, “content stream operators” are not “PDF Keywords”, so that the statements (more than one) in 7.8.2 need to change, for example it states:

An operator is a PDF keyword

Also, please confirm that these operators are “PDF Objects”!

Are all types of keyword also “PDF Objects”? This may need clarification somewhere.

car222222 avatar Jun 13 '24 04:06 car222222

@car222222 - trying to answer your questions:

So no other tokens should, from now on, be described using the term “PDF keyword”.

Yes.

But what about the more basic term “keyword”? Do we need a list of these too?

I reviewed all occurrences of "keyword" before making proposing the above. Besides the current confusion with operators (and a separate italicised use as a fragment identifier view parameter in Table Annex O.4), all other uses mean the same as "PDF keyword." I used "PDF keyword" for the exact reason you ask this question --> there are no other forms of keywords in PDF 😀

Or are “PDF keywords” and “keywords” identical entities?

Yes

And are all types of keyword “PDF objects”?

No. Keywords are not PDF objects according to clause 7.3. Objects can require/include certain keywords tho' - see 7.3.2 Boolean, 7.3.8 Stream objects 7.3.9 Null object, and 7.3.10 Indirect objects. Think of keywords and content stream operators more at a lexical or token level, rather than as objects which are a high-order concept (but I wouldn't want to state any of that in the spec).

Also, from now, “content stream operators” are not “PDF Keywords”, so that the statements (more than one) in 7.8.2 need to change, for example it states: An operator is a PDF keyword

Yes. If this proposal is accepted we will need to change this wording from "keywords" to "content stream operator". They are different things.

Also, please confirm that these operators are “PDF Objects”!

No - as above.

Are all types of keyword also “PDF Objects”? This may need clarification somewhere.

No - as above.

Hope that helps.

petervwyatt avatar Jun 13 '24 07:06 petervwyatt

Many thanks for the clarifications!

I can now search for all the places that need changed wordings!:-)

It might be useful to explicitly state somewhere that "keyword" and "PDF keyword" are synonyms.

car222222 avatar Jun 13 '24 07:06 car222222

Yes. Or always use the full phrase "PDF keyword". All terms will also need to be added to the official T&Ds eventually. Up to the PDF TWG to decide.

petervwyatt avatar Jun 13 '24 07:06 petervwyatt

Clause 7.3 will anyway need some extension: e.g., to add these "content stream operators" as an extra type of object.

Also, maybe to clarify that "null, true, false" are both "PDF Objects" and "PDF keywords" (or that, as "PDF keywords", they are only the "names of objects")?

car222222 avatar Jun 13 '24 07:06 car222222

No - The additions will be new subclasses to 7.2 lexical conventions.

true and false are keywords that represent a Boolean object with a specific value, in the same way as the null keyword represents the singular(!) null object. Yes, this may feel like semantics but it is an important distinction. The only changes necessary in 7.3 would be "keyword" --> "PDF keyword" if we wanted to bother.

petervwyatt avatar Jun 13 '24 07:06 petervwyatt