pdf-issues icon indicating copy to clipboard operation
pdf-issues copied to clipboard

Relation between `xmpMM:DocumentID` and document ID

Open seehuhn opened this issue 1 year ago • 6 comments

The "minimal PDF file" in appendix H.2 uses the xmpMM:DocumentID and xmpMM:InstanceID properties in its XMP metadata stream, and explains that these properties are a "unique GUID of document" and a "GUID changed for each save", respectively. The purpose of these fields seems very similar to the two elements of the ID array in the file trailer dictionary, as introduced in Section 14.4 (File identifiers).

It would be nice if the PDF spec explained the relation between these two pairs of identifiers: Are writers mean to generate two sets of independent identifiers for each document, or can/should/shall the XMP identifiers be somehow derived from the PDF file identifiers?

Also, are the XMP identifiers required or optional? (If optional, maybe don't show them in the "minimal file" example?)

seehuhn avatar Apr 25 '24 14:04 seehuhn

A few notes:

  • Annex H is very old and was not maintained for PDF 2.0 - there are probably some outdated and deprecated features being used

  • XMP is only ever metadata, nothing more. And metadata is always optional for "general-purpose PDF" - but it is required for ISO subsets such as PDF/A, PDF/UA and PDF/X as per their specific standards. The trailer ID entry is "real" PDF data and used with encryption (see Table 15).

petervwyatt avatar Apr 25 '24 15:04 petervwyatt

Understood. (But note that the XMP metadata stream was not shown in the examples in the PDF 1.7 spec. It seems to have been added for the 2.0 spec.)

seehuhn avatar Apr 25 '24 18:04 seehuhn

I think the best solution may be to simply remove the following lines from all XMP examples:

<rdf:Description rdf:about="" xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/">
<xmpMM:DocumentID>… unique GUID of document …</xmpMM:DocumentID>
<xmpMM:InstanceID>… GUID changed for each save …</xmpMM:InstanceID>
</rdf:Description>

The PDF spec seems like an odd place to explain the xmpMM:DocumentID and xmpMM:InstanceID properties, and there seems to be little benefit in showing these entries in the examples at all.

seehuhn avatar Apr 26 '24 08:04 seehuhn

I agree. ISO 32K-2 doesn't need to spell out anything to do with the internals of XMP for "general PDF" - that's the job of XMP spec or the PDF ISO subsets where lots of specific things are required.

petervwyatt avatar Apr 27 '24 00:04 petervwyatt

@petervwyatt What is the proposed changed here?

lrosenthol avatar May 07 '24 05:05 lrosenthol

In Annex H, remove all the XMP gory micro-details (since that is the job of the XMP spec) and just leave block comments of what the XMP needs to represent - and NOT explain things like which xmpMM things to be preserved or updated. Search for "xmpMM:" to see the 2 examples in Annex H.

petervwyatt avatar May 07 '24 09:05 petervwyatt

PDF TWG agree

petervwyatt avatar May 23 '24 20:05 petervwyatt

PDF/A TWG doesn't see any immediate need for any notes on how to align XMP-based ID's with trailer IDs. The use of this data is very different in various implementations.

bdoubrov avatar Jun 10 '24 15:06 bdoubrov