Add option to not change the PDF CreationDate for reproducibility
When cleaning an already cleaned PDF, the resulting PDF differs from the first, which should not be the case in regards to reproducibility.
Steps to reproduce
$ wget https://freetestdata.com/wp-content/uploads/2021/09/Free_Test_Data_100KB_PDF.pdf -O test.pdf
$ mat2 test.pdf
$ mat2 test.cleaned.pdf
$ diffoscope test.cleaned.pdf test.cleaned.cleaned.pdf
--- test.cleaned.pdf
+++ test.cleaned.cleaned.pdf
│ --- test.cleaned.pdf
├── +++ test.cleaned.cleaned.pdf
│┄ Document info
│ @@ -1,2 +1,2 @@
│ -CreationDate: "D:20251026082356+01'00"
│ +CreationDate: "D:20251026082450+01'00"
│ Producer: 'cairo 1.18.4 (https://cairographics.org)'
├── dumppdf -at {}
│ @@ -523,24 +523,24 @@
│ <value><number>6</number></value>
│ <key>First</key>
│ <value><number>37</number></value>
│ <key>Filter</key>
│ <value><literal>FlateDecode</literal></value>
│ </dict>
│ </props>
│ -<data size="912">1 0 36 64 2 165 13 388 24 613 37 838 << /Type /Pages /Kids [ 2 0 R 13 0 R 24 0 R ] /Count 3 >> << /Producer (cairo 1.18.4 (https://cairographics.org)) /CreationDate (D:20251026082356+01'00) >> << /Type /Page % 1 /Parent 1 0 R /MediaBox [ 0 0 612 792 ] /Contents 4 0 R /Group << /Type /Group /S /Transparency /I true /CS /DeviceRGB >> /Resources 3 0 R /StructParents 0 >> << /Type /Page % 2 /Parent 1 0 R /MediaBox [ 0 0 612 792 ] /Contents 15 0 R /Group << /Type /Group /S /Transparency /I true /CS /DeviceRGB >> /Resources 14 0 R /StructParents 1 >> << /Type /Page % 3 /Parent 1 0 R /MediaBox [ 0 0 612 792 ] /Contents 26 0 R /Group << /Type /Group /S /Transparency /I true /CS /DeviceRGB >> /Resources 25 0 R /StructParents 2 >> << /Type /Catalog /Pages 1 0 R >> </data>
│ +<data size="912">1 0 36 64 2 165 13 388 24 613 37 838 << /Type /Pages /Kids [ 2 0 R 13 0 R 24 0 R ] /Count 3 >> << /Producer (cairo 1.18.4 (https://cairographics.org)) /CreationDate (D:20251026082450+01'00) >> << /Type /Page % 1 /Parent 1 0 R /MediaBox [ 0 0 612 792 ] /Contents 4 0 R /Group << /Type /Group /S /Transparency /I true /CS /DeviceRGB >> /Resources 3 0 R /StructParents 0 >> << /Type /Page % 2 /Parent 1 0 R /MediaBox [ 0 0 612 792 ] /Contents 15 0 R /Group << /Type /Group /S /Transparency /I true /CS /DeviceRGB >> /Resources 14 0 R /StructParents 1 >> << /Type /Page % 3 /Parent 1 0 R /MediaBox [ 0 0 612 792 ] /Contents 26 0 R /Group << /Type /Group /S /Transparency /I true /CS /DeviceRGB >> /Resources 25 0 R /StructParents 2 >> << /Type /Catalog /Pages 1 0 R >> </data>
│ </stream>
│ </object>
│
│ <object id="36">
│ <dict size="2">
│ <key>Producer</key>
│ <value><string size="40">cairo 1.18.4 (https://cairographics.org)</string></value>
│ <key>CreationDate</key>
│ -<value><string size="22">D:20251026082356+01'00</string></value>
│ +<value><string size="22">D:20251026082450+01'00</string></value>
│ </dict>
│ </object>
│
│ <object id="37">
│ <dict size="2">
│ <key>Type</key>
│ <value><literal>Catalog</literal></value>
So mat2 changes the CreationDate, it would be great to have at least the option to not touch the CreationDate to ensure reproducibility, maybe with a --reproducible parameter ?
This isn't possible, as finding out if a PDF is clean from metadata is borderline intractable. mat2 unconditionally rasterize the PDF into pictures, then assemble the results into a new PDF.
If you want reproducibility, you can use the --lightweight option when dealing with PDF, as this will only remove superficial metadata and should thus be reproducible.
Nevertheless, what's your usecase for wanting idempotency in mat2 processing?
Interesting, I wasn't aware of the --lightweight option. However, using this option results in a huge diff, cut off here for better readability:
$ wget https://freetestdata.com/wp-content/uploads/2021/09/Free_Test_Data_100KB_PDF.pdf -O test.pdf
$ mat2 --lightweight test.pdf
$ mat2 --lightweight test.cleaned.pdf
$ diffoscope test.cleaned.pdf test.cleaned.cleaned.pdf
--- test.cleaned.pdf
+++ test.cleaned.cleaned.pdf
│ --- test.cleaned.pdf
├── +++ test.cleaned.cleaned.pdf
│┄ Document info
│ @@ -1,2 +1,2 @@
│ -CreationDate: "D:20251026153817+01'00"
│ +CreationDate: "D:20251026153821+01'00"
│ Producer: 'cairo 1.18.4 (https://cairographics.org)'
├── dumppdf -at {}
│ @@ -60,15 +60,15 @@
│ </dict></value>
│ <key>Font</key>
│ <value><dict size="3">
│ <key>f-0-0</key>
│ <value><ref id="7" /></value>
│ <key>f-1-1</key>
│ <value><ref id="8" /></value>
│ -<key>f-1-0</key>
│ +<key>f-2-0</key>
│ <value><ref id="9" /></value>
│ </dict></value>
│ </dict>
│ </object>
│
│ <object id="4">
│ <stream>
│ @@ -76,35 +76,35 @@
│ <dict size="2">
│ <key>Length</key>
│ <value><ref id="5" /></value>
│ <key>Filter</key>
│ <value><literal>FlateDecode</literal></value>
│ </dict>
│ </props>
│ -<data size="14631">1 0 0 -1 0 792 cm q 1 1 1 rg /a0 gs 70.586 71.996 470.949 14.305 re f* 0 0 0 rg BT 10.56 0 0 -10.56 72.024 83.18 Tm /f-0-0 1 Tf [(L)-3(ore)16(m )-173(ips)13(um)10( )-171(dol)3(or)13( )-172(sit)3( )-159 (amet,)10( )-172(co)13(nse)3(cte)5(t)1
...
My usecase is generating PDFs (i.e. merging 2 PDFs with pdftk) and tracking them in git, where I only want to commit content changes.