Reading incremental PDFs returns wrong (first) object per object number
I use pypdf to open pdf file that contains attachments and save to a new file. I found that new file does not have attachments.
Environment
$ python -m platform
Linux-6.14.9-orbstack-gd9e87d038362-aarch64-with-glibc2.36
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.6.0, crypt_provider=('cryptography', '42.0.8'), PIL=10.4.0
Code + PDF
This is a minimal, complete example that shows the issue:
with open("/opt/app/file.pdf", "rb") as f:
pdf_base = f.read()
pdf_base = PdfWriter(clone_from=pdf_base)
output_iostream = io.BytesIO()
pdf_base.write(output_iostream)
with output_iostream:
output_iostream.seek(0)
pdf_bytes = output_iostream.getvalue()
with open("/opt/output.pdf", "wb") as f:
f.write(pdf_bytes)
This is my sample pdf file that have the issue original.pdf
Problem
pdf output does not have attachments
Expected Result
pdf output should have attachments
Thanks for the report.
This is not related to the writer, but a general issue with the PDF file and how it is processed, where we are not even detecting the attachments:
from pypdf import PdfReader
reader = PdfReader("missing_attachment.pdf")
print(list(reader.attachment_list))
The catalog looks okay:
{'/Type': '/Catalog', '/Version': '/1.7', '/Pages': IndirectObject(2, 0, 140080531858448), '/Metadata': IndirectObject(21, 0, 140080531858448), '/Outlines': IndirectObject(22, 0, 140080531858448), '/Names': IndirectObject(26, 0, 140080531858448)}
The /Names for the embedded files point to object 26, which occurs three times inside the document due to the PDF being an incremental one. We currently only read the first one, which is a cross-reference dictionary:
26 0 obj
<<
/Length 76
/Root 1 0 R
/Info 24 0 R
/ID [<BE09E22BADBF38EEDC41DE6FC461296B> <BE09E22BADBF38EEDC41DE6FC461296B>]
/Type /XRef
/Size 27
/Index [0 26]
/W [1 2 1]
/Filter /FlateDecode
>>
...
endobj
We should read the second one instead:
26 0 obj
<<
/EmbeddedFiles 27 0 R
>>
endobj
Nevertheless, we have a third definition of the same object which now represents a content stream:
26 0 obj
<<
/Length 24006
/Filter /FlateDecode
/Length1 52696
>>
stream
...
endstream
endobj
As far as I understand section 7.5.6 of the PDF 2.0 specification, we should not read the first object, but the last, if there are duplicates. This would not solve your issue with the attachments as this would read the third version with the content stream, but at least satisfy the specification.
Properly implementing support for incremental files probably requires some more analysis and efforts as far as I can see. I am open to proposals and PRs attempting to tackle this, if necessary in steps.