pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

BUG: Ambiguous translated references

Open biredel opened this issue 3 months ago • 7 comments

pypdf uses id(obj) to keep track of objects in _id_translated. This identifier is not unique for different objects, only for objects existing at the same time. e.g. In CPython id(obj), is just a memory address. This cannot be used to tell whether we already have referenced a page or not. It might point to a page of a since-deleted PdfReader, silently corrupting out output.

The bug manifests rarely, and the corrupted output document usually looks OK at first glance (but has missing/duplicated content), the following is my best reproducer so far, adapted from https://stackoverflow.com/questions/78186323/

# python3 repro.py < resources/multilang.pdf

import gc, io, sys
from pypdf import PdfReader, PdfWriter

def pdfsize(fin):
    pdfout = PdfWriter()
    pageout = pdfout.add_blank_page(width=10, height=10)
    reader = None
    for i in range(80):
        fin.seek(0)
        del reader
        gc.collect(0)
        reader = PdfReader(fin, strict=True)
        for pagein in reader.pages:
            pageout.merge_page(pagein)
    with io.BytesIO() as fout:
        pdfout.write(fout)
        return fout.tell()

if __name__ == "__main__":
    with io.BytesIO() as fin:
        fin.write(sys.stdin.buffer.read())
        for i in range(50):
            if i == 0:
                first_size = pdfsize(fin)
                first_size = pdfsize(fin)
            current_size = pdfsize(fin)
            if not first_size == current_size:
                raise RuntimeError(
                    f"generated PDF file size changed after repeating {i} times {first_size=} != {current_size=}"
                )

The attached draft replaces dict[id(obj)] with dict[weakref(obj)] - For the simple insert/lookup/delete cases, that behaves the same. Yet no-longer-existing objects are cleared automatically, removing the possible collision.

There is another instance of id(obj) outside of __repr__() methods in PageObject.hash_value_data - I have not determined whether that can cause related problems.

Essentially resubmitting https://github.com/py-pdf/pypdf/pull/1995 as a replacement for https://github.com/py-pdf/pypdf/pull/1841 which has memory usage side-effects and does not fully avoid https://github.com/py-pdf/pypdf/issues/1788 anyway.

biredel avatar Mar 29 '24 20:03 biredel