Encoding of URL in developer extensions dictionary (ISO 32000-2:2020)
Table 49 does not specify the encoding of the URL entry. In other places, e.g., table 238 /("ASCII string") and 7.11.5 ("RFC 3986"), the encoding is specified, but none of these apply to developer extensions dictionaries.
It is currently defined as string so this means that any kind of string from subclause 7.9.2 is permitted.
But since this represents a URL any Unicode should be %-encoded so reducing to ASCII string seems appropriate.
But since this represents a URL any Unicode should be %-encoded so reducing to ASCII string seems appropriate.
Do we allow IRIs? If so, this is too restrictive. If not, we should probably clarify that (e.g. with a note saying that IRIs can be represented using percent encoding or punycode).
I think URLs are sufficient for this purpose so adding a note to use percent-encoding would be fine. This then is aligned with 7.11.5, 12.6.4.8, and 14.10.3.2 uses of URLs.
ISO 32000-2 does not mention Punycode or reference the RFC now, so I'd avoid extending.
Leonard wrote a doc for ISO back in 2009/2010 about IRIs, URLs, PunyCode, etc. We will locate this doc and review in the PDF Association to see if a worthwhile TechNote can be made.
Here is the information from the document in question:
We’ve evaluated this issue in the past, as we are aware that it’s a problem with PDF existing in the “modern web”. There are two reasons that we haven’t solved it – 1) language/file format issues and 2) Acrobat/Reader changes.
As far as the file format (PDF language) is concerned, the problem is really around compatibility. Today, a URI can be included in a PDF in the following places: • Base URI entry in the Catalog – URI Dictionary • Navigator UUID – text string • Link Annotation /PA (from Web Capture) – URI Dictionary • URI action – URI Dictionary • URI entry in the URI Dictionary – ASCII string • RichText link – XHTML string/stream • OutputIntent RegistryName – text string
In addition, there are few places in the PDF spec that refers to URLs instead of URIs and specifically references RFC 1738 and 7-bit ASCII (except where noted below). • URL specifications in a FileSpec dictionary – ASCII string (or PDDocEncoding is allowed) • URL entry in Extension dictionary – text string** • URL entry in TimeStamp seed value – ASCII string • URL entry in Certificate seed value – ASCII string • Caption entry in PaperMetaData dictionary – text string** • Submit action – URL-based file specification • BU entry for MediaClips MH/BE dicts – ASCII string • U key in Software Identifier (for Media) – ASCII string • URL Strings in WebCapture content sets – ASCII string • AU entry in source information dictionary – ASCII string • U and C entries in URL Alias dictionary – ASCII string • URL entry in Web Capture command dictionary – ASCII string • URLs entry in OutputIntent – URL-based file specification
As you can see from this list, almost every place that uses a URL or URI defines it as a 7-bit ASCII string, although there are a limited set of places that happen to allow “text strings”, which are de-fined in either PDDocEncoding (ISO Latin 1) or UTF-16BE.
Although it would be possible to simply change the definition of some/all of the ASCII strings to “text strings” in UTF-16BE and maintain file format compatibility (since a string is a string syntactically) – the fact is that you’d break compatibility with existing readers (from Adobe and elsewhere) who are only expecting those values to be ASCII. This would mean that for some/all of these keys that you wanted to support IRIs you need to create NEW keys where the IRI data would go (plus you’d probably also put in a recommend to have the producer put the URI information in as well).
ISO 32000-2 did not adopt PaperMetaData or the Navigator UUID.
We also added the NS entry in the Logical structure Namespace dictionary (14.7.4.2 and Table 356) as a text string. And there are a few more new PDF 2.0 features that utilize File Spec dictionaries and thus "inherit" URI/URLs via that mechanism: Associated Files and PronunciationLexicon to name just two.
JS (ECMAscript) can also include URI/URLs. And, of course, the entire PDF Fragment Identifier feature in Annex O (but that is not a file format thing).
See also #256