pdf-issues
pdf-issues copied to clipboard
File identifiers (14.4): Observations and possible improvements
SUB-ISSUE 1: File identifiers definition
Subclause 14.4 (File identifiers) states:
The value of this entry shall be an array of two byte strings. The first byte string shall be a permanent identifier based on the contents of the PDF file at the time it was originally created [...]
To me, the requirement of being based on the contents of the PDF file is arbitrary and misleading; the fact that, since PDF 2.0, even the suggested MD5-based hashing algorithm has dropped its main content-related input ("The values of all entries in the file's document information dictionary [...]" (see PDF 1.7)) speaks volumes — now it retains merely "The size of the PDF file in bytes" as a barely content-related input!
As later stated, the sole relevant attribute of a computed identifier is uniqueness (which implies the statistical robustness of the generative algorithm against collisions):
PDF writers should attempt to ensure the uniqueness of file identifiers.
Even the NOTE to subclause 14.4 in PDF 1.7 stressed that "all that matters is that the identifier is likely to be unique"... What about standard generative algorithms like, say, UUID as alternatives to the suggested MD5-based hashing algorithm? I do not advocate to dismiss the suggested hashing algorithm, just to reword subclause 14.4 in order to make more clear the distinction between specification (identifier uniqueness) and implementation (algorithms suitable to meet the specification).
Therefore, IMO, the sentence in subclause 14.4 should be reformulated, like so:
The value of this entry shall be an array of two unique byte strings, each at least 16 bytes long. The first byte string shall be a permanent identifier, not to change when the PDF file is updated. The second byte string shall be a changing identifier, computed when the PDF file is updated (see 7.5.6, "Incremental updates").
While the last paragraph ("PDF writers should attempt to ensure the uniqueness [...]") would be replaced by:
Identifier uniqueness should be attempted by employing a suitable generative algorithm, such as UUID (described in Internet RFC 4122); this may also be achieved by means of a message digest algorithm such as MD5 (described in Internet RFC 1321), using the following information:
- the current time;
- a string representation of the PDF file's location;
- the size of the PDF file in bytes.
SUB-ISSUE 2: File identifiers mapping to XMP metadata
Apparently, there is a lack of documentation at core PDF level regarding the relation between PDF file identifiers and document identifiers in XMP metadata: while Table 349 in subclause 14.3.3 (Document information dictionary) suggests a precise mapping between information entries and document-level XMP metadata, there is no indication for the reconciliation between file identifier's (permanent and changing) byte strings and semantically-corresponding document identifiers (DocumentID
and InstanceID
) in Media Management namespace. Although higher-level PDF specs in ISO stack (such as PDF/A) are designed to add constraints atop the core PDF spec, the latter could nonetheless provide a general mapping suggestion the same way it already does for the document information dictionary entries (after all, even the mapping of those entries is mentioned both at core PDF and PDF/A levels!).
In XMP metadata, document identifiers are typed as GUID
, which the XMP spec (part 1 (ISO 16684-1:2011), Annex A) describes eloquently (emphasis is mine):
This document defines three GUIDs that are intended to help manage copies of a resource [(xmpMM:DocumentID, xmpMM:InstanceID, xmpMM:OriginalDocumentID)], to identify a specific state when desired, and to associate related copies of the same conceptual resource. [...] The use of robust GUIDs is encouraged; having globally unique values is important. In practical terms, this means that the probability of a collision is so remote as to be effectively impossible. Typically, 128-bit or 144-bit numbers are used, encoded as hexadecimal strings. This document does not require any particular methodology for creating a GUID, nor does it require any specific means of formatting the GUID as a simple XMP value. The only valid operations on XMP IDs are to create them, to assign one to another, and to compare two of them for equality. Comparisons use the Unicode string value as-is, using a direct byte-for-byte check for equality. IETF RFC 4122 ( http://www.ietf.org/rfc/rfc4122.txt ) describes ways to create and format GUID strings. For privacy, the use of a MAC address is not recommended. See section 4.1.6 of RC 4122 for details and alternatives.
According to the description here above, PDF file identifier byte strings seem compatible with document identifiers in XMP Media Management namespace — furthermore, AFAIK, PDF/A allows freedom of identification scheme (identifiers may be externally based (eg, ISBN) or internally based (eg, UUID)). Could it be acceptable in general cases (ie, without specific constraints) to assign the permanent file identifier byte string of a given PDF file to the corresponding xmpMM:DocumentID
property, and its changing file identifier byte string to the corresponding xmpMM:InstanceID
property?
This way, it would be possible to have, by default, a single pair of identifiers, without unnecessary redundancies, like so (byte strings here below are obviously fictitious):
trailer
<<
/Size 5275
/Root 118 0 R
/Info 2751 0 R
/ID [<33333333333333333333333333333333><44444444444444444444444444444444>]
/Prev 259675
>>
. . .
2152 0 obj<< /Subtype /XML /Length 4751 /Type /Metadata >>
stream
<?xpacket begin="" id="XXXXXXXXXXXXXXXXXXXXXXXX"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description xmlns:xmpMM='http://ns.adobe.com/xap/1.0/mm/'
xmpMM:DocumentID='33333333333333333333333333333333'
xmpMM:InstanceID='44444444444444444444444444444444'/>
. . .
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
endstream
endobj