pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Circular references in XObjects cause infinite recursion crash

Open dhdaines opened this issue 8 months ago • 11 comments

Here is an evil PDF for you, which contains a circular reference loop in its Form XObjects:

evil_xobjects.pdf

pdfminer.six (for example, pdf2txt.py evil_xobjects.pdf) will crash on it with a RecursionError.

The location of this error may vary with the phase of the moon, but the problem is just in PDFPageInterpreter.do_Do . The simplest way to fix it is to keep track of the set of parent XObjects in each PDFPageInterpreter and refuse to invoke a Form XObject that is in this set.

dhdaines avatar Apr 10 '25 18:04 dhdaines

Note that other renderers do more ... interesting things with this. PDFium for instance seems to allow up to a certain number of recursions and then stops, while Ghostscript detects the reference right away and stops.

pdf.js currently seems to loop forever (unless it's been fixed!), going to go file a bug :)

dhdaines avatar Apr 10 '25 19:04 dhdaines

I am having the same issue with a different pdf.

text = extract_text(pdf_path, page_numbers=[page_num]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\high_level.py", line 184, in extract_text interpreter.process_page(page) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\pdfinterp.py", line 1211, in process_page self.device.end_page(page) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\converter.py", line 89, in end_page self.cur_item.analyze(self.laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 934, in analyze group.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams)

############ Repeats for many many more lines, but I cut those off. ############

RecursionError: maximum recursion depth exceeded

cmyers009 avatar Jun 24 '25 17:06 cmyers009

The issue appears to be due to not keeping track of visited objects in the do_Do function of pdfinterp.py

This appears to fix the issue:

    # ------------------------------------------------------------------
    #MODIFIED#
    # ------------------------------------------------------------------

def do_Do(self, xobjid_arg: PDFStackT) -> None:
    """Invoke named XObject, skipping circular Form-XObject references."""
    xobjid = literal_name(xobjid_arg)

    # ------------------------------------------------------------------
    # Circular-reference guard
    # ------------------------------------------------------------------
    if not hasattr(self, "_processing_xobjs"):
        # Lazily create the shared set the first time we enter do_Do.
        self._processing_xobjs: set[str] = set()

    if xobjid in self._processing_xobjs:
        # We have already begun processing this XObject somewhere higher
        # in the call stack -> ignore the recursive call.
        log.warning("Skipping XObject %r due to circular reference", xobjid)
        return

    # Mark this XObject as "in progress" for the duration of this call.
    self._processing_xobjs.add(xobjid)
    try:
        # ------------------------------------------------------------------
        # Original logic (unchanged except for one tiny tweak noted below)
        # ------------------------------------------------------------------
        try:
            xobj = stream_value(self.xobjmap[xobjid])
        except KeyError:
            if settings.STRICT:
                raise PDFInterpreterError("Undefined xobject id: %r" % xobjid)
            return

        log.debug("Processing xobj: %r", xobj)
        subtype = xobj.get("Subtype")

        if subtype is LITERAL_FORM and "BBox" in xobj:
            # Child interpreter must share the same circular-reference set.
            interpreter = self.dup()
            interpreter._processing_xobjs = self._processing_xobjs  #NEW#

            bbox = cast(Rect, list_value(xobj["BBox"]))
            matrix = cast(Matrix, list_value(xobj.get("Matrix", MATRIX_IDENTITY)))

            # PDFs < 1.2 fall back to the page resources.
            xobjres = xobj.get("Resources")
            resources = dict_value(xobjres) if xobjres else self.resources.copy()

            self.device.begin_figure(xobjid, bbox, matrix)
            interpreter.render_contents(
                resources,
                [xobj],
                ctm=mult_matrix(matrix, self.ctm),
            )
            self.device.end_figure(xobjid)

        elif subtype is LITERAL_IMAGE and "Width" in xobj and "Height" in xobj:
            self.device.begin_figure(xobjid, (0, 0, 1, 1), MATRIX_IDENTITY)
            self.device.render_image(xobjid, xobj)
            self.device.end_figure(xobjid)

        else:
            # Unsupported XObject type.
            pass
    finally:
        # Always pop, even if render_contents raised.
        self._processing_xobjs.remove(xobjid)

cmyers009 avatar Jun 24 '25 18:06 cmyers009

@pietermarsman Can you make a PR for this change?

cmyers009 avatar Jun 24 '25 18:06 cmyers009

@dhdaines Can you make a PR for this change?

cmyers009 avatar Jun 24 '25 19:06 cmyers009

@dhdaines Can you make a PR for this change?

Yes the fix is pretty simple, I can supply one later today I think.

dhdaines avatar Jun 25 '25 14:06 dhdaines

Ah, I see you've already supplied a possible fix.

I think it would be more elegant to pass in a set of parent XObjects when creating an interpreter rather than poking around in private attributes.

dhdaines avatar Jun 25 '25 14:06 dhdaines

Ah, I see you've already supplied a possible fix.

I think it would be more elegant to pass in a set of parent XObjects when creating an interpreter rather than poking around in private attributes.

Sure, you can make that change if you would like. You know the library better than me.

cmyers009 avatar Jun 25 '25 14:06 cmyers009

Ok! I supplied maybe a more general fix. Another evil thing that one might do is to treat any old content stream (such as a page, or a Type3 font program) as a Form XObject, or vice versa, so the check is done as close as possible to the actual execution of content streams.

dhdaines avatar Jun 25 '25 15:06 dhdaines

@pietermarsman I believe this should actually also be type:security

It falls under this vulnerable category https://cwe.mitre.org/data/definitions/835.html

cmyers009 avatar Nov 10 '25 16:11 cmyers009