pdf2archive
pdf2archive copied to clipboard
Preserve XMP-dc and XMP-xmpRights metadata (and not only)
GS discards all the existing XMP metadata, and only writes the ones generated on its own. However, it might be that the original file has some additional XMP metadata (which is likely the case if it was generated e.g. with LaTeX and the pdfx
package -- take a look at the pdfx
docs). The idea is to preserve these additional metadata as much as possible.
I'd say that the most important ones, in this case, are the Dublin Core metadata. The idea is to extract them with exiftool
from the original document and then feed them back to the converted document (see e.g. this example). This has to be done with some care, because the PDF metadata in the Info dictionary (such as title, author, etc) have to match the corresponding entries in the XMP metadata. The ideal solution would be to add only the non-existing XMP keys after GS's conversion, while keeping the existing XMP ones (which are assured by GS to match the existing ones in the PDF Info dictionary).
This behavior should be made optional, since it does depend on an external tool (exiftool
) other than Ghostscript and the result is not guaranteed to still be PDF/A-1B compliant. Maybe it could be triggered by a --keepxmpmetadata
option or something similar. In the same way, it would be nice to make it easy to add non-existent XMP metadata; a possible route would be to support some sort of simple metadata document format, similar to this one. Then an option like --addxmpmetadata=document.xmpdata
would do the job.
Here an example that compares the original document.pdf
from this example (which is a valid PDF/A-1B directly from LaTeX) with the document-PDFA.pdf
document resulting from the conversion with pdf2archive
:
$ exiftool -a -G1 document.pdf
[ExifTool] ExifTool Version Number : 10.80
[System] File Name : document.pdf
[System] Directory : .
[System] File Size : 64 kB
[System] File Modification Date/Time : 2018:03:11 18:03:12+01:00
[System] File Access Date/Time : 2018:03:11 18:26:15+01:00
[System] File Inode Change Date/Time : 2018:03:11 18:07:25+01:00
[System] File Permissions : rw-r--r--
[File] File Type : PDF
[File] File Type Extension : pdf
[File] MIME Type : application/pdf
[PDF] PDF Version : 1.4
[PDF] Linearized : No
[PDF] Page Count : 1
[PDF] Page Mode : UseOutlines
[PDF] Title : Production of PDF/A-Compliant Documents
[PDF] Subject : This is a sample document.
[PDF] Creator : LaTeX with hyperref package
[PDF] Create Date : 2018:03:11 18:03:07+01:00
[PDF] Modify Date : 2018:03:11 18:03:07+01:00
[PDF] Producer : pdfTeX
[PDF] Trapped : False
[PDF] GTS PDFA1 Version : PDF/A-1b:2005
[PDF] PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3
[XMP-x] XMP Toolkit : Adobe XMP Core 4.0-c316 44.253921, Sun Oct 01 2006 17:14:39
[XMP-pdfaExtension] Schemas Namespace URI : http://ns.adobe.com/pdfx/1.3/
[XMP-pdfaExtension] Schemas Prefix : pdfx
[XMP-pdfaExtension] Schemas Schema : PDF/X Schema
[XMP-pdfaExtension] Schemas Property Category : external
[XMP-pdfaExtension] Schemas Property Description: URL to an online version or preprint
[XMP-pdfaExtension] Schemas Property Name : AuthoritativeDomain
[XMP-pdfaExtension] Schemas Property Value Type : Text
[XMP-pdfaExtension] Schemas Value Type : .
[XMP-pdfaExtension] Schemas Schema : PRISM metadata
[XMP-pdfaExtension] Schemas Namespace URI : http://prismstandard.org/namespaces/basic/2.2/
[XMP-pdfaExtension] Schemas Prefix : prism
[XMP-pdfaExtension] Schemas Property Name : aggregationType
[XMP-pdfaExtension] Schemas Property Value Type : Text
[XMP-pdfaExtension] Schemas Property Category : external
[XMP-pdfaExtension] Schemas Property Description: The type of publication. If defined, must be one of book, catalog, feed, journal, magazine, manual, newsletter, pamphlet.
[XMP-pdfaExtension] Schemas Value Type : .
[XMP-pdf] Producer : pdfTeX
[XMP-dc] Format : application/pdf
[XMP-dc] Title : Production of PDF/A-Compliant Documents
[XMP-dc] Creator : Mr. Document Guy
[XMP-dc] Publisher :
[XMP-dc] Rights : Copyright © 2017 "Mr. Document Guy"
[XMP-dc] Description : This is a sample document.
[XMP-dc] Subject : PDF, Archiving, LaTeX
[XMP-pdfaid] Part : 1
[XMP-pdfaid] Conformance : B
[XMP-xmp] Creator Tool : LaTeX with hyperref package
[XMP-xmp] Modify Date : 2018:03:11 18:03:07+01:00
[XMP-xmp] Create Date : 2018:03:11 18:03:07+01:00
[XMP-xmp] Metadata Date : 2018:03:11 18:03:07+01:00
[XMP-xmpRights] Marked : True
[XMP-xmpRights] Usage Terms : Copyright © 2017 "Mr. Document Guy"
[XMP-xmpMM] Document ID : uuid:467B87E0-A1EA-A037-7CB7-0477245DEBC3
[XMP-xmpMM] Instance ID : uuid:5BE82E03-BDCF-F30F-AC8F-19E215F00935
$ exiftool -a -G1 document-PDFA.pdf
[ExifTool] ExifTool Version Number : 10.80
[System] File Name : document-PDFA.pdf
[System] Directory : .
[System] File Size : 15 kB
[System] File Modification Date/Time : 2018:03:11 18:26:15+01:00
[System] File Access Date/Time : 2018:03:12 14:45:39+01:00
[System] File Inode Change Date/Time : 2018:03:11 18:26:15+01:00
[System] File Permissions : rw-r--r--
[File] File Type : PDF
[File] File Type Extension : pdf
[File] MIME Type : application/pdf
[PDF] PDF Version : 1.4
[PDF] Linearized : No
[PDF] Page Count : 1
[PDF] Producer : GPL Ghostscript 9.22
[PDF] Create Date : 2018:03:11 18:26:15+01:00
[PDF] Modify Date : 2018:03:11 18:26:15+01:00
[PDF] Creator : LaTeX with hyperref package
[PDF] Title : Production of PDF/A-Compliant Documents
[PDF] Subject : This is a sample document.
[PDF] Author :
[PDF] Trapped : False
[XMP-x] XMP Toolkit : XMP toolkit 2.9.1-13, framework 1.6
[XMP-pdf] Producer : GPL Ghostscript 9.22
[XMP-pdf] Keywords :
[XMP-xmp] Modify Date : 2018:03:11 18:26:15+01:00
[XMP-xmp] Create Date : 2018:03:11 18:26:15+01:00
[XMP-xmp] Creator Tool : LaTeX with hyperref package
[XMP-xmpMM] Document ID : uuid:d9c82353-5d6d-11f3-0000-9737b4daf3fd
[XMP-dc] Format : application/pdf
[XMP-dc] Title : Production of PDF/A-Compliant Documents
[XMP-dc] Creator :
[XMP-dc] Description : This is a sample document.
[XMP-pdfaid] Part : 1
[XMP-pdfaid] Conformance : B
Another thing is to try to preserve TeX custom PDF metadata in the Info dictionary, like in this case:
/GTS_PDFA1Version (PDF/A-1b:2005)
/PTEX.Fullbanner (This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3)