PDF driver add trailer/EOF instructions that make them PDF/X invalid
What is the bug?
I work on CMYK support for QGIS and try to generate PDF/X-4 (ready for print PDF specification) valid files.
QGIS uses GDAL to update PDF metadata (projection, author, creation date..) but when it does, it modifies the file in a way that the generated PDF is no longer PDF/X-4 valid.
Steps to reproduce the issue
- Download test_cmyk_no_gdal.pdf and copy it in /tmp
- Launch hexedit /tmp/test_cmyk_no_gdal.pdf and search
ID [, you read
trailer.<<./Size 21 ./Info 1 0 R./Root 6 0 R./ID [
<61323935376139312d653735322d346338632d616361352d663435303037633833333465> <61323935376139312d653735322d346338632d616361352d663435303037633833333465> ].
>>.
startxref.
2676923 .
%%EOF.
- launch the following GDAL python program
from osgeo import gdal
ds = gdal.Open("/tmp/test_cmyk_no_gdal.pdf", gdal.GA_Update)
ds.SetMetadataItem( "AUTHOR", "Julien" )
- Launch hexedit /tmp/test_cmyk_no_gdal.pdf and search
ID [, you now read
.trailer.<<./Size 21 ./Info 1 0 R./Root 6 0 R./ID [
<61323935376139312d653735322d346338632d616361352d663435303037633833333465> <61323935376139312d653735322d346338632d616361352d663435303037633833333465> ].
>>.
startxref.
2676923 .
%%EOF.
1 0 obj.
<< /Author (Julien) /CreationDate (D:20240718093827+02'00') /Producer (Qt 6.8.0) /Title (/home/julien/Nextcloud/Temp/test_cmyk.pdf) >>.
endobj.
xref.
0 1.
0000000000 65535 f .
1 1.0002677584 00000 n .
trailer.
<< /Info 1 0 R /Prev 2676923 /Root 6 0 R /Size 21 >>.
startxref.
2677734.
%%EOF.
The Prev 2676923 instruction seems to reference the previous trailer, so it might be fine (though I don't know much about PDF specification), but we have 2 EOF instructions and I don't think it's OK.
I used preflight tool (prépresse in French) from Adobe Acrobat Reader pro to check if generated files are PDF/X-4 valid and the GDAL modified one get an extra error : "Absence d'ID du document" (Missing document ID)
Before GDAL
After GDAL
Versions and provenance
- Debian testing
- python3-gdal debian package 3.8.5+dfsg-1+b1
Additional context
Just a side note, GDAL doesn't manage XMP metadata consistency. Meaning that if I change the metadata item CREATION_DATE, the related XMP metadata instruction is not udpated accordingly, and so the PDF is not PDF/X-4 valid.
It looks like that GDAL doesn't want to assure consistency and I plan to do it in QGIS, so no extra issue here. But please correct me if I'm wrong and you think that it should be fixed in GDAL.
but we have 2 EOF instructions and I don't think it's OK.
At least, for regular PDFs, that's fine. The PDF spec (version 1.7) mentions at page 99: "a file that has been updated several times contains several trailers; each trailer is terminated by its own end-of-file (%%EOF ) marker"
I suspect that PDF/X-4 has stronger requirements that the base PDF spec. PDF is a super complicated format, and GDAL mostly do it "at hand" (at least on the writing side). I've no idea what supporting PDF/X-4 would involve. Perhaps the standard update procedure doesn't work for PDF/X-4, and that you need to generate a new file, actually updating original objects, instead of appending the updates?
GDAL doesn't manage XMP metadata consistency.
"obviously" not :-)
@troopa81 you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one
you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one
indeed. But that will just generate a new regular PDF file, not a PDF/X-4.
At least, for regular PDFs, that's fine. The PDF spec (version 1.7) mentions at page 99: "a file that has been updated several times contains several trailers; each trailer is terminated by its own end-of-file (%%EOF ) marker"
OK, so "maybe" this is because the second trailer lacks a ID instruction ? I'll try to paste the first one in the second trailer to check if it complies
you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one
indeed. But that will just generate a new regular PDF file, not a PDF/X-4.
Yes, I don't know if it's feasible to have at the same time a GeoPdf which complies to the PDF/X-4 format. That would require to add the embedded ICC profile to the XML composition file (But I know little about the way the geopdf are exported).
OK, so "maybe" this is because the second trailer lacks a ID instruction ? I'll try to paste the first one in the second trailer to check if it complies
I confirm that the issue comes from the missing ID in the second trailer. If I just copy/paste the ID from the previous trailer, it complies.
From the PDF/X-4 specification, I only read
The ID key in the file trailer shall be present.
I try to look for a fix in Gdal. I understand that the Info trailer is set to be updated when we modify the metadata, which lead to this comment. IIRC podofo is in charge of updating/fix the file on write and so would be the culprit here. But I'm unsure because the gdal documentation states that no dependencies is used on write.
I try to look for a fix in Gdal. I understand that the Info trailer is set to be updated when we modify the metadata, which lead to this comment. IIRC podofo is in charge of updating/fix the file on write and so would be the culprit here. But I'm unsure because the gdal documentation states that no dependencies is used on write.
@troopa81 Update is a bit of a mix. Poppler or Podofo are used to build the existing PDF object hierarchy, but update/writing is done "at hand" in GDALPDFBaseWriter::WriteXRefTableAndTrailer() in frmts/pdf/pdfcreatecopy.cpp