Circular references in XObjects cause infinite recursion crash
Here is an evil PDF for you, which contains a circular reference loop in its Form XObjects:
pdfminer.six (for example, pdf2txt.py evil_xobjects.pdf) will crash on it with a RecursionError.
The location of this error may vary with the phase of the moon, but the problem is just in PDFPageInterpreter.do_Do . The simplest way to fix it is to keep track of the set of parent XObjects in each PDFPageInterpreter and refuse to invoke a Form XObject that is in this set.
Note that other renderers do more ... interesting things with this. PDFium for instance seems to allow up to a certain number of recursions and then stops, while Ghostscript detects the reference right away and stops.
pdf.js currently seems to loop forever (unless it's been fixed!), going to go file a bug :)
I am having the same issue with a different pdf.
text = extract_text(pdf_path, page_numbers=[page_num]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\high_level.py", line 184, in extract_text interpreter.process_page(page) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\pdfinterp.py", line 1211, in process_page self.device.end_page(page) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\converter.py", line 89, in end_page self.cur_item.analyze(self.laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 934, in analyze group.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 674, in analyze super().analyze(laparams) File "C:\Users{USER}\AppData\Local\Programs\Python\Python311\Lib\site-packages\pdfminer\layout.py", line 435, in analyze obj.analyze(laparams)
############ Repeats for many many more lines, but I cut those off. ############
RecursionError: maximum recursion depth exceeded
The issue appears to be due to not keeping track of visited objects in the do_Do function of pdfinterp.py
This appears to fix the issue:
# ------------------------------------------------------------------
#MODIFIED#
# ------------------------------------------------------------------
def do_Do(self, xobjid_arg: PDFStackT) -> None:
"""Invoke named XObject, skipping circular Form-XObject references."""
xobjid = literal_name(xobjid_arg)
# ------------------------------------------------------------------
# Circular-reference guard
# ------------------------------------------------------------------
if not hasattr(self, "_processing_xobjs"):
# Lazily create the shared set the first time we enter do_Do.
self._processing_xobjs: set[str] = set()
if xobjid in self._processing_xobjs:
# We have already begun processing this XObject somewhere higher
# in the call stack -> ignore the recursive call.
log.warning("Skipping XObject %r due to circular reference", xobjid)
return
# Mark this XObject as "in progress" for the duration of this call.
self._processing_xobjs.add(xobjid)
try:
# ------------------------------------------------------------------
# Original logic (unchanged except for one tiny tweak noted below)
# ------------------------------------------------------------------
try:
xobj = stream_value(self.xobjmap[xobjid])
except KeyError:
if settings.STRICT:
raise PDFInterpreterError("Undefined xobject id: %r" % xobjid)
return
log.debug("Processing xobj: %r", xobj)
subtype = xobj.get("Subtype")
if subtype is LITERAL_FORM and "BBox" in xobj:
# Child interpreter must share the same circular-reference set.
interpreter = self.dup()
interpreter._processing_xobjs = self._processing_xobjs #NEW#
bbox = cast(Rect, list_value(xobj["BBox"]))
matrix = cast(Matrix, list_value(xobj.get("Matrix", MATRIX_IDENTITY)))
# PDFs < 1.2 fall back to the page resources.
xobjres = xobj.get("Resources")
resources = dict_value(xobjres) if xobjres else self.resources.copy()
self.device.begin_figure(xobjid, bbox, matrix)
interpreter.render_contents(
resources,
[xobj],
ctm=mult_matrix(matrix, self.ctm),
)
self.device.end_figure(xobjid)
elif subtype is LITERAL_IMAGE and "Width" in xobj and "Height" in xobj:
self.device.begin_figure(xobjid, (0, 0, 1, 1), MATRIX_IDENTITY)
self.device.render_image(xobjid, xobj)
self.device.end_figure(xobjid)
else:
# Unsupported XObject type.
pass
finally:
# Always pop, even if render_contents raised.
self._processing_xobjs.remove(xobjid)
@pietermarsman Can you make a PR for this change?
@dhdaines Can you make a PR for this change?
Here are some example test pdfs which replicate the issue.
evil3_three_cycle_text.pdf evil1_self_ref_text.pdf evil2_mutual_ref_text.pdf evil5_four_cycle_text.pdf evil4_deep_self_nested_text.pdf
@dhdaines Can you make a PR for this change?
Yes the fix is pretty simple, I can supply one later today I think.
Ah, I see you've already supplied a possible fix.
I think it would be more elegant to pass in a set of parent XObjects when creating an interpreter rather than poking around in private attributes.
Ah, I see you've already supplied a possible fix.
I think it would be more elegant to pass in a set of parent XObjects when creating an interpreter rather than poking around in private attributes.
Sure, you can make that change if you would like. You know the library better than me.
Ok! I supplied maybe a more general fix. Another evil thing that one might do is to treat any old content stream (such as a page, or a Type3 font program) as a Form XObject, or vice versa, so the check is done as close as possible to the actual execution of content streams.
@pietermarsman I believe this should actually also be type:security
It falls under this vulnerable category https://cwe.mitre.org/data/definitions/835.html