Reading incremental PDFs returns wrong (first) object per object number

Open piyapan039285 opened this issue 6 months ago • 1 comments

I use pypdf to open pdf file that contains attachments and save to a new file. I found that new file does not have attachments.

Environment

$ python -m platform
Linux-6.14.9-orbstack-gd9e87d038362-aarch64-with-glibc2.36

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.6.0, crypt_provider=('cryptography', '42.0.8'), PIL=10.4.0

Code + PDF

This is a minimal, complete example that shows the issue:

with open("/opt/app/file.pdf", "rb") as f:
    pdf_base = f.read()

pdf_base = PdfWriter(clone_from=pdf_base)
output_iostream = io.BytesIO()
pdf_base.write(output_iostream)
with output_iostream:
    output_iostream.seek(0)
    pdf_bytes = output_iostream.getvalue()

with open("/opt/output.pdf", "wb") as f:
    f.write(pdf_bytes)

This is my sample pdf file that have the issue original.pdf

Problem

pdf output does not have attachments

Expected Result

pdf output should have attachments

Jun 05 '25 12:06 piyapan039285

Thanks for the report.

This is not related to the writer, but a general issue with the PDF file and how it is processed, where we are not even detecting the attachments:

from pypdf import PdfReader

reader = PdfReader("missing_attachment.pdf")
print(list(reader.attachment_list))

The catalog looks okay:

{'/Type': '/Catalog', '/Version': '/1.7', '/Pages': IndirectObject(2, 0, 140080531858448), '/Metadata': IndirectObject(21, 0, 140080531858448), '/Outlines': IndirectObject(22, 0, 140080531858448), '/Names': IndirectObject(26, 0, 140080531858448)}

The /Names for the embedded files point to object 26, which occurs three times inside the document due to the PDF being an incremental one. We currently only read the first one, which is a cross-reference dictionary:

26 0 obj
<<
/Length 76
/Root 1 0 R
/Info 24 0 R
/ID [<BE09E22BADBF38EEDC41DE6FC461296B> <BE09E22BADBF38EEDC41DE6FC461296B>]
/Type /XRef
/Size 27
/Index [0 26]
/W [1 2 1]
/Filter /FlateDecode
>>
...
endobj

We should read the second one instead:

26 0 obj
<<
/EmbeddedFiles 27 0 R
>>
endobj

Nevertheless, we have a third definition of the same object which now represents a content stream:

26 0 obj
<<
/Length 24006
/Filter /FlateDecode
/Length1 52696
>>
stream
...
endstream
endobj

As far as I understand section 7.5.6 of the PDF 2.0 specification, we should not read the first object, but the last, if there are duplicates. This would not solve your issue with the attachments as this would read the third version with the content stream, but at least satisfy the specification.

Properly implementing support for incremental files probably requires some more analysis and efforts as far as I can see. I am open to proposals and PRs attempting to tackle this, if necessary in steps.

Jun 05 '25 13:06 stefan6419846