gdal icon indicating copy to clipboard operation
gdal copied to clipboard

PDF driver add trailer/EOF instructions that make them PDF/X invalid

Open troopa81 opened this issue 1 year ago • 6 comments

What is the bug?

I work on CMYK support for QGIS and try to generate PDF/X-4 (ready for print PDF specification) valid files.

QGIS uses GDAL to update PDF metadata (projection, author, creation date..) but when it does, it modifies the file in a way that the generated PDF is no longer PDF/X-4 valid.

Steps to reproduce the issue

  • Download test_cmyk_no_gdal.pdf and copy it in /tmp
  • Launch hexedit /tmp/test_cmyk_no_gdal.pdf and search ID [, you read
trailer.<<./Size 21 ./Info 1 0 R./Root 6 0 R./ID [
<61323935376139312d653735322d346338632d616361352d663435303037633833333465> <61323935376139312d653735322d346338632d616361352d663435303037633833333465> ].
>>.
startxref.
2676923 .
%%EOF.
  • launch the following GDAL python program
from osgeo import gdal
ds = gdal.Open("/tmp/test_cmyk_no_gdal.pdf", gdal.GA_Update)
ds.SetMetadataItem( "AUTHOR", "Julien" ) 
  • Launch hexedit /tmp/test_cmyk_no_gdal.pdf and search ID [, you now read
.trailer.<<./Size 21 ./Info 1 0 R./Root 6 0 R./ID [
<61323935376139312d653735322d346338632d616361352d663435303037633833333465> <61323935376139312d653735322d346338632d616361352d663435303037633833333465> ].
>>.
startxref.
2676923 .
%%EOF.
1 0 obj.
<< /Author (Julien) /CreationDate (D:20240718093827+02'00') /Producer (Qt 6.8.0) /Title (/home/julien/Nextcloud/Temp/test_cmyk.pdf) >>.
endobj.
xref.
0 1.
0000000000 65535 f .
1 1.0002677584 00000 n .
trailer.
<< /Info 1 0 R /Prev 2676923 /Root 6 0 R /Size 21 >>.
startxref.
2677734.
%%EOF.

The Prev 2676923 instruction seems to reference the previous trailer, so it might be fine (though I don't know much about PDF specification), but we have 2 EOF instructions and I don't think it's OK.

I used preflight tool (prépresse in French) from Adobe Acrobat Reader pro to check if generated files are PDF/X-4 valid and the GDAL modified one get an extra error : "Absence d'ID du document" (Missing document ID)

Before GDAL sans_gdal

After GDAL avec_gdal

Versions and provenance

  • Debian testing
  • python3-gdal debian package 3.8.5+dfsg-1+b1

Additional context

Just a side note, GDAL doesn't manage XMP metadata consistency. Meaning that if I change the metadata item CREATION_DATE, the related XMP metadata instruction is not udpated accordingly, and so the PDF is not PDF/X-4 valid.

It looks like that GDAL doesn't want to assure consistency and I plan to do it in QGIS, so no extra issue here. But please correct me if I'm wrong and you think that it should be fixed in GDAL.

troopa81 avatar Jul 18 '24 08:07 troopa81

but we have 2 EOF instructions and I don't think it's OK.

At least, for regular PDFs, that's fine. The PDF spec (version 1.7) mentions at page 99: "a file that has been updated several times contains several trailers; each trailer is terminated by its own end-of-file (%%EOF ) marker"

I suspect that PDF/X-4 has stronger requirements that the base PDF spec. PDF is a super complicated format, and GDAL mostly do it "at hand" (at least on the writing side). I've no idea what supporting PDF/X-4 would involve. Perhaps the standard update procedure doesn't work for PDF/X-4, and that you need to generate a new file, actually updating original objects, instead of appending the updates?

GDAL doesn't manage XMP metadata consistency.

"obviously" not :-)

rouault avatar Jul 18 '24 18:07 rouault

@troopa81 you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one

nyalldawson avatar Jul 18 '24 21:07 nyalldawson

you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one

indeed. But that will just generate a new regular PDF file, not a PDF/X-4.

rouault avatar Jul 18 '24 22:07 rouault

At least, for regular PDFs, that's fine. The PDF spec (version 1.7) mentions at page 99: "a file that has been updated several times contains several trailers; each trailer is terminated by its own end-of-file (%%EOF ) marker"

OK, so "maybe" this is because the second trailer lacks a ID instruction ? I'll try to paste the first one in the second trailer to check if it complies

you might have more luck using the pdf composition XML file approach (which is used in QGIS for geopdf exports) to generate a completely new pdf from the input one

indeed. But that will just generate a new regular PDF file, not a PDF/X-4.

Yes, I don't know if it's feasible to have at the same time a GeoPdf which complies to the PDF/X-4 format. That would require to add the embedded ICC profile to the XML composition file (But I know little about the way the geopdf are exported).

troopa81 avatar Jul 19 '24 07:07 troopa81

OK, so "maybe" this is because the second trailer lacks a ID instruction ? I'll try to paste the first one in the second trailer to check if it complies

I confirm that the issue comes from the missing ID in the second trailer. If I just copy/paste the ID from the previous trailer, it complies.

From the PDF/X-4 specification, I only read

The ID key in the file trailer shall be present.

I try to look for a fix in Gdal. I understand that the Info trailer is set to be updated when we modify the metadata, which lead to this comment. IIRC podofo is in charge of updating/fix the file on write and so would be the culprit here. But I'm unsure because the gdal documentation states that no dependencies is used on write.

troopa81 avatar Aug 20 '24 09:08 troopa81

I try to look for a fix in Gdal. I understand that the Info trailer is set to be updated when we modify the metadata, which lead to this comment. IIRC podofo is in charge of updating/fix the file on write and so would be the culprit here. But I'm unsure because the gdal documentation states that no dependencies is used on write.

@troopa81 Update is a bit of a mix. Poppler or Podofo are used to build the existing PDF object hierarchy, but update/writing is done "at hand" in GDALPDFBaseWriter::WriteXRefTableAndTrailer() in frmts/pdf/pdfcreatecopy.cpp

rouault avatar Aug 26 '24 16:08 rouault