pypdf BUG: clean_forms function cause infinite looping if elt["/Resources"] have circular relation

I have several PDF files from customers, and using remove_text can cause infinite looping. Upon investigation, I discovered a corner case where elt["/Resources"] has a circular relation, which can result in calling clean_forms(content, stack + [elt]) infinitely.

I proposed keeping a memory variable visited_resources to keep track of which elt["/Resources"] has been processed and to avoid infinite looping.

The corner case files are private and cannot be shared, but I believe many people would encounter the same problem. Closes #2474

Feb 27 '24 11:02 syan-dev

Thanks for the PR. Are you able to provide a corresponding test file privately to @MartinThoma or strip one of the files down to further use it inside the tests and thus ensure that this keeps being fixed and coverage does not drop?

Feb 27 '24 11:02 stefan6419846

Hi @stefan6419846, these PDF files are private; therefore, I cannot share them with you. However, if I encounter any files experiencing similar issues, I will send them to you later. I hope you can incorporate this code so that I can simply install PyPDF with pip install. Thank you for providing this excellent source code.

Feb 27 '24 15:02 syan-dev

Apparently this "just" exceeds the recursion depth instead of blocking completely? At least this is what my experimental code to reproduce this issue indicates. (My code still breaks with this PR, but as it just exploits this with custom Python-based datastructures and no real PDF file, I do not consider this essential for merging.)

Traceback (most recent call last):
  File "/home/stefan/tmp/pypdf_upstream/pypdf/run.py", line 54, in <module>
    writer.remove_objects_from_page(page=page, to_delete=ObjectDeletionFlag.NONE)
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1836, in remove_objects_from_page
    images, forms = clean_forms(page, [], [])
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
    clean_forms(content, stack + [elt], visited_objects)  # clean sub forms
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
    clean_forms(content, stack + [elt], visited_objects)  # clean sub forms
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
    clean_forms(content, stack + [elt], visited_objects)  # clean sub forms
  [Previous line repeated 984 more times]
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1803, in clean_forms
    content = ContentStream(o, self)
  File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/generic/_data_structures.py", line 941, in __init__
    if isinstance(stream, ArrayObject):
  File "/usr/lib/python3.10/typing.py", line 1503, in __instancecheck__
    issubclass(instance.__class__, cls)):
  File "/usr/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
RecursionError: maximum recursion depth exceeded in comparison

Unless @pubpub-zz states that your fix might have negative impacts, I plan on merging it tomorrow.

I still appreciate a corresponding unit/integration test nevertheless if you are able to provide one - even if its a separate PR later on.

Feb 27 '24 16:02 stefan6419846

If you look at the code, there is already a stack mecanism designed to prevent infinite loops. I therefore do not understand we need to 'duplicate' it with a new list. In order to have a better understanding, can you at least provide a screen capture of the left list of pdfbox in 'debug': pdfbox-app-3.0.0-alpha3.jar debug pdfbox can be found here: https://pdfbox.apache.org/2.0/commandline.html

this should help us to understand what is inside.

Currently I do not recommend merging

Feb 27 '24 17:02 pubpub-zz

Here is the image of looping, I try my best and can not understand why this happens but it takes forever to load only one page with this kind of looping

Feb 29 '24 08:02 syan-dev

I think I have found the root cause for this, I have printed to check if elt in stack with elt value and stack value. You can see that in the following output, elt is in stack but false to check elt in stack because we have included /Resources in the item of stack but not elt.

elt in stack: 
False
elt value:
{'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792]}
stack value:
[{'/Type': '/Page', '/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI'], '/XObject': {'/Im0': IndirectObject(5, 0, 140032982484208), '/Fm0': IndirectObject(6, 0, 140032982484208)}, '/Font': {'/F0': IndirectObject(8, 0, 140032982484208), '/F1': IndirectObject(9, 0, 140032982484208), '/F2': IndirectObject(10, 0, 140032982484208), '/F3': IndirectObject(36, 0, 140032982484208), '/F4': IndirectObject(41, 0, 140032982484208)}}, '/Contents': IndirectObject(46, 0, 140032982484208), '/MediaBox': [0, 0, 612, 792], '/CropBox': [0, 0, 612, 792], '/Rotate': 0, '/Parent': IndirectObject(1, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(7, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(12, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(14, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(16, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(18, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(20, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(22, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(24, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(26, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(34, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(28, 0, 140032982484208)}]

Therefore, my propose solution is to also include /Resources key to correct comparison:

if k1 not in ["/Length", "/Filter", "/DecodeParms", "/Resources"]

Feb 29 '24 08:02 syan-dev

Hi @pubpub-zz , can you help me review the screenshot from pdfbox.

Mar 01 '24 02:03 syan-dev

Review in progress

Mar 01 '24 10:03 pubpub-zz

a) first I'm confused with your output : the infinite is within your PDF and is not normal. b) I've made a small change and I've added some logs : can you run the program on your document replacing clean_forms with this:

        def clean_forms(
            elt: DictionaryObject, stack: List[DictionaryObject]
        ) -> Tuple[List[str], List[str]]:
            nonlocal to_delete
            try:
                print("elt : ",elt.indirect_reference)
            except Exception:
                print("elt  ---")
            print("stack ind: ", [s.indirect_reference for s in stack])
            print("stack native: ", stack)
            print("in? ",elt in stack)
            if elt in stack:
                # to prevent infinite looping
                return [], []  # pragma: no cover
            try:
                d = cast(
                    Dict[Any, Any],
                    cast(DictionaryObject, elt["/Resources"])["/XObject"],
                )
            except KeyError:
                d = {}
            images = []
            forms = []
            for k, v in d.items():
                print("k",k)
                o = v.get_object()
                try:
                    content: Any = None
                    if (
                        to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
                        and o["/Subtype"] == "/Image"
                    ):
                        content = NullObject()  # to delete the image keeping the entry
                        images.append(k)
                        i
                    if o["/Subtype"] == "/Form":
                        forms.append(k)
                        if isinstance(o, ContentStream):
                            content = o
                        else:
                            content = ContentStream(o, self)
                            content.update(
                                {
                                    k1: v1
                                    for k1, v1 in o.items()
                                    if k1 not in ["/Length", "/Filter", "/DecodeParms"]
                                }
                            )
                        stack.append(elt)
                        print("call")
                        clean_forms(content, stack)  # clean sub forms
                    if content is not None:
                        if isinstance(v, IndirectObject):
                            self._objects[v.idnum - 1] = content
                        else:
                            # should only occur with pdf not respecting pdf spec
                            # where streams must be indirected.
                            d[k] = self._add_object(content)  # pragma: no cover
                except (TypeError, KeyError):
                    pass
            for im in images:
                del d[im]  # for clean-up
            if isinstance(elt, StreamObject):  # for /Form
                if not isinstance(elt, ContentStream):  # pragma: no cover
                    e = ContentStream(elt, self)
                    e.update(elt.items())
                    elt = e
                clean(elt, images, forms)  # clean the content
            return images, forms

        if not isinstance(page, PageObject):
            page = PageObject(self, page.indirect_reference)  # pragma: no cover
        if "/Contents" in page:
            content = cast(ContentStream, page.get_contents())

            images, forms = clean_forms(page, [])

            clean(content, images, forms)
            page.replace_contents(content)

Mar 01 '24 10:03 pubpub-zz

Hi @pubpub-zz , here is the log from your code. I have to edit it a little to avoid print ContentStream error

print("stack ind: ", [s.indirect_reference for s in stack])
print("stack ind: ", [s.indirect_reference for s in stack if not isinstance(s, ContentStream)])

log.txt

Mar 05 '24 04:03 syan-dev

a) I have some problem to understand your traces : your first line "elt : {'/Type': ..." should indicate an indirect_object can you please provide your full test code to see how you are building your PdfWriter (please provide/use a minimal code)

b) the "elt ---" indicates that the object are not indirect_objects which means that they are newly generated that I do not understand but explains the infinite loop can you please rerun with the following test code (I've seen your note and took that into account however not tested)

        def clean_forms(
            elt: DictionaryObject, stack: List[DictionaryObject]
        ) -> Tuple[List[str], List[str]]:
            nonlocal to_delete
            try:
                print("elt : ",type(elt.indirect_reference), elt.indirect_reference)
            except Exception:
                print("elt  ---",e.hash_value())
            print("stack ind: ", [(s.indirect_reference if "indirect_reference" in s.__dir__() else "---" )for s in stack ])
            print("stack native: ", stack)
            print("in? ",elt in stack)
            if elt in stack:
                # to prevent infinite looping
                return [], []  # pragma: no cover
            try:
                d = cast(
                    Dict[Any, Any],
                    cast(DictionaryObject, elt["/Resources"])["/XObject"],
                )
            except KeyError:
                d = {}
            images = []
            forms = []
            for k, v in d.items():
                print("k",k)
                o = v.get_object()
                try:
                    content: Any = None
                    if (
                        to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
                        and o["/Subtype"] == "/Image"
                    ):
                        content = NullObject()  # to delete the image keeping the entry
                        images.append(k)
                    if o["/Subtype"] == "/Form":
                        forms.append(k)
                        print("hash(",k,"): ",o.hash_value()," in ",d.hash_value())
                        print("value :",type(v),v)
                        if isinstance(o, ContentStream):
                            content = o
                        else:
                            content = ContentStream(o, self)
                            content.update(
                                {
                                    k1: v1
                                    for k1, v1 in o.items()
                                    if k1 not in ["/Length", "/Filter", "/DecodeParms"]
                                }
                            )
                        stack.append(elt)
                        print("call")
                        clean_forms(content, stack)  # clean sub forms
                    if content is not None:
                        if isinstance(v, IndirectObject):
                            self._objects[v.idnum - 1] = content
                        else:
                            # should only occur with pdf not respecting pdf spec
                            # where streams must be indirected.
                            d[k] = self._add_object(content)  # pragma: no cover
                except (TypeError, KeyError):
                    pass
            for im in images:
                del d[im]  # for clean-up
            if isinstance(elt, StreamObject):  # for /Form
                if not isinstance(elt, ContentStream):  # pragma: no cover
                    e = ContentStream(elt, self)
                    e.update(elt.items())
                    elt = e
                clean(elt, images, forms)  # clean the content
            return images, forms

        if not isinstance(page, PageObject):
            page = PageObject(self, page.indirect_reference)  # pragma: no cover
        if "/Contents" in page:
            content = cast(ContentStream, page.get_contents())

            images, forms = clean_forms(page, [])

            clean(content, images, forms)
            page.replace_contents(content)

Mar 05 '24 05:03 pubpub-zz

Hi @pubpub-zz

a, Here is the test code I am using:

import io
from pypdf import PdfReader, PdfWriter

pdf_content = open('file_name.pdf', 'rb').read()

pdf_writer = PdfWriter()
pdf_reader = PdfReader(io.BytesIO(pdf_content))

pdf_writer.add_page(pdf_reader.pages[0])
pdf_writer.remove_text()

b, Here is the log I got with the new code you provided

log2.txt

Mar 05 '24 06:03 syan-dev

I start to understand. can you retry with this code (may fix the issue):

        def clean_forms(
            elt: DictionaryObject, stack: List[DictionaryObject]
        ) -> Tuple[List[str], List[str]]:
            nonlocal to_delete
            try:
                print("elt : ",type(elt.indirect_reference), elt.indirect_reference.__repr__())
            except Exception:
                print("elt  ---",e.hash_value())
            print("stack ind: ", [(s.indirect_reference if "indirect_reference" in s.__dir__() else "---" )for s in stack ])
            print("stack native: ", stack)
            print("in? ",elt in stack)
            if hasattr(elt, "indirect_reference"):
                if any(( elt.indirect_reference == getattr(x,"indirect_reference",-1) for x in stack)):
                    return [], []
            if elt in stack:
                # to prevent infinite looping
                return [], []  # pragma: no cover
            try:
                d = cast(
                    Dict[Any, Any],
                    cast(DictionaryObject, elt["/Resources"])["/XObject"],
                )
            except KeyError:
                d = {}
            images = []
            forms = []
            for k, v in d.items():
                print("k",k)
                o = v.get_object()
                try:
                    content: Any = None
                    if (
                        to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
                        and o["/Subtype"] == "/Image"
                    ):
                        content = NullObject()  # to delete the image keeping the entry
                        images.append(k)
                    if o["/Subtype"] == "/Form":
                        forms.append(k)
                        print("hash(",k,"): ",o.hash_value()," in ",d.hash_value())
                        print("value :",type(v),v)
                        if isinstance(o, ContentStream):
                            content = o
                        else:
                            content = ContentStream(o, self)
                            content.update(
                                {
                                    k1: v1
                                    for k1, v1 in o.items()
                                    if k1 not in ["/Length", "/Filter", "/DecodeParms"]
                                }
                            )
                            try:
                                content.indirect_reference = o.indirect_reference
                            except AttributeError:
                                pass
                        stack.append(elt)
                        print("call")
                        clean_forms(content, stack)  # clean sub forms
                    if content is not None:
                        if isinstance(v, IndirectObject):
                            self._objects[v.idnum - 1] = content
                        else:
                            # should only occur with pdf not respecting pdf spec
                            # where streams must be indirected.
                            d[k] = self._add_object(content)  # pragma: no cover
                except (TypeError, KeyError):
                    pass
            for im in images:
                del d[im]  # for clean-up
            if isinstance(elt, StreamObject):  # for /Form
                if not isinstance(elt, ContentStream):  # pragma: no cover
                    e = ContentStream(elt, self)
                    e.update(elt.items())
                    elt = e
                clean(elt, images, forms)  # clean the content
            return images, forms

        if not isinstance(page, PageObject):
            page = PageObject(self, page.indirect_reference)  # pragma: no cover
        if "/Contents" in page:
            content = cast(ContentStream, page.get_contents())

            images, forms = clean_forms(page, [])

            clean(content, images, forms)
            page.replace_contents(content)

Mar 05 '24 06:03 pubpub-zz

Thank you very much. Somehow, using stack.append(elt) instead of stack + [elt] fixed the problem. Here is the log for your updated code. log3.txt

Mar 05 '24 07:03 syan-dev

Thank you very much. Somehow, using stack.append(elt) instead of stack + [elt] fixed the problem. Here is the log for your updated code. log3.txt

Can you confirm that it is only this last proposal that fixes the issue ?

Mar 06 '24 18:03 pubpub-zz

@syanng what would be great would be to have a test file. removing text and replacing all image with some dumb image(looping through images in a page and replacing it with a test image), could you try to produce a test file ?

Mar 06 '24 21:03 pubpub-zz

Yes, the issue has been fixed. Thank you very much. For the test file, could you please guide me on how to replace text content or image content with dumb text and dumb images?

Mar 07 '24 02:03 syan-dev

For the test file, could you please guide me on how to replace text content or image content with dumb text and dumb images?

You might want to have a look at the remove_* methods of PdfWriter as in the docs: https://pypdf.readthedocs.io/en/latest/modules/PdfWriter.html#pypdf.PdfWriter.remove_links

Mar 07 '24 08:03 stefan6419846

fixed in #2505

Mar 16 '24 19:03 pubpub-zz