pypdf
pypdf copied to clipboard
BUG: clean_forms function cause infinite looping if elt["/Resources"] have circular relation
I have several PDF files from customers, and using remove_text
can cause infinite looping. Upon investigation, I discovered a corner case where elt["/Resources"]
has a circular relation, which can result in calling clean_forms(content, stack + [elt])
infinitely.
I proposed keeping a memory variable visited_resources
to keep track of which elt["/Resources"]
has been processed and to avoid infinite looping.
The corner case files are private and cannot be shared, but I believe many people would encounter the same problem. Closes #2474
Thanks for the PR. Are you able to provide a corresponding test file privately to @MartinThoma or strip one of the files down to further use it inside the tests and thus ensure that this keeps being fixed and coverage does not drop?
Hi @stefan6419846, these PDF files are private; therefore, I cannot share them with you. However, if I encounter any files experiencing similar issues, I will send them to you later. I hope you can incorporate this code so that I can simply install PyPDF with pip install
. Thank you for providing this excellent source code.
Apparently this "just" exceeds the recursion depth instead of blocking completely? At least this is what my experimental code to reproduce this issue indicates. (My code still breaks with this PR, but as it just exploits this with custom Python-based datastructures and no real PDF file, I do not consider this essential for merging.)
Traceback (most recent call last):
File "/home/stefan/tmp/pypdf_upstream/pypdf/run.py", line 54, in <module>
writer.remove_objects_from_page(page=page, to_delete=ObjectDeletionFlag.NONE)
File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1836, in remove_objects_from_page
images, forms = clean_forms(page, [], [])
File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
clean_forms(content, stack + [elt], visited_objects) # clean sub forms
File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
clean_forms(content, stack + [elt], visited_objects) # clean sub forms
File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1811, in clean_forms
clean_forms(content, stack + [elt], visited_objects) # clean sub forms
[Previous line repeated 984 more times]
File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/_writer.py", line 1803, in clean_forms
content = ContentStream(o, self)
File "/home/stefan/tmp/pypdf_upstream/pypdf/pypdf/generic/_data_structures.py", line 941, in __init__
if isinstance(stream, ArrayObject):
File "/usr/lib/python3.10/typing.py", line 1503, in __instancecheck__
issubclass(instance.__class__, cls)):
File "/usr/lib/python3.10/abc.py", line 123, in __subclasscheck__
return _abc_subclasscheck(cls, subclass)
RecursionError: maximum recursion depth exceeded in comparison
Unless @pubpub-zz states that your fix might have negative impacts, I plan on merging it tomorrow.
I still appreciate a corresponding unit/integration test nevertheless if you are able to provide one - even if its a separate PR later on.
If you look at the code, there is already a stack mecanism designed to prevent infinite loops. I therefore do not understand we need to 'duplicate' it with a new list.
In order to have a better understanding, can you at least provide a screen capture of the left list of pdfbox in 'debug':
pdfbox-app-3.0.0-alpha3.jar debug
pdfbox can be found here: https://pdfbox.apache.org/2.0/commandline.html
this should help us to understand what is inside.
Currently I do not recommend merging
Here is the image of looping, I try my best and can not understand why this happens but it takes forever to load only one page with this kind of looping
I think I have found the root cause for this, I have printed to check if elt in stack
with elt
value and stack
value.
You can see that in the following output, elt
is in stack
but false to check elt in stack
because we have included /Resources
in the item of stack
but not elt
.
elt in stack:
False
elt value:
{'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792]}
stack value:
[{'/Type': '/Page', '/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC', '/ImageI'], '/XObject': {'/Im0': IndirectObject(5, 0, 140032982484208), '/Fm0': IndirectObject(6, 0, 140032982484208)}, '/Font': {'/F0': IndirectObject(8, 0, 140032982484208), '/F1': IndirectObject(9, 0, 140032982484208), '/F2': IndirectObject(10, 0, 140032982484208), '/F3': IndirectObject(36, 0, 140032982484208), '/F4': IndirectObject(41, 0, 140032982484208)}}, '/Contents': IndirectObject(46, 0, 140032982484208), '/MediaBox': [0, 0, 612, 792], '/CropBox': [0, 0, 612, 792], '/Rotate': 0, '/Parent': IndirectObject(1, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(7, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(12, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(14, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(16, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(18, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(20, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(22, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(24, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(26, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(34, 0, 140032982484208)}, {'/Type': '/XObject', '/Subtype': '/Form', '/FormType': 1, '/BBox': [0, 0, 612, 792], '/Resources': IndirectObject(28, 0, 140032982484208)}]
Therefore, my propose solution is to also include /Resources
key to correct comparison:
if k1 not in ["/Length", "/Filter", "/DecodeParms", "/Resources"]
Hi @pubpub-zz , can you help me review the screenshot from pdfbox.
Review in progress
a) first I'm confused with your output : the infinite is within your PDF and is not normal. b) I've made a small change and I've added some logs : can you run the program on your document replacing clean_forms with this:
def clean_forms(
elt: DictionaryObject, stack: List[DictionaryObject]
) -> Tuple[List[str], List[str]]:
nonlocal to_delete
try:
print("elt : ",elt.indirect_reference)
except Exception:
print("elt ---")
print("stack ind: ", [s.indirect_reference for s in stack])
print("stack native: ", stack)
print("in? ",elt in stack)
if elt in stack:
# to prevent infinite looping
return [], [] # pragma: no cover
try:
d = cast(
Dict[Any, Any],
cast(DictionaryObject, elt["/Resources"])["/XObject"],
)
except KeyError:
d = {}
images = []
forms = []
for k, v in d.items():
print("k",k)
o = v.get_object()
try:
content: Any = None
if (
to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
and o["/Subtype"] == "/Image"
):
content = NullObject() # to delete the image keeping the entry
images.append(k)
i
if o["/Subtype"] == "/Form":
forms.append(k)
if isinstance(o, ContentStream):
content = o
else:
content = ContentStream(o, self)
content.update(
{
k1: v1
for k1, v1 in o.items()
if k1 not in ["/Length", "/Filter", "/DecodeParms"]
}
)
stack.append(elt)
print("call")
clean_forms(content, stack) # clean sub forms
if content is not None:
if isinstance(v, IndirectObject):
self._objects[v.idnum - 1] = content
else:
# should only occur with pdf not respecting pdf spec
# where streams must be indirected.
d[k] = self._add_object(content) # pragma: no cover
except (TypeError, KeyError):
pass
for im in images:
del d[im] # for clean-up
if isinstance(elt, StreamObject): # for /Form
if not isinstance(elt, ContentStream): # pragma: no cover
e = ContentStream(elt, self)
e.update(elt.items())
elt = e
clean(elt, images, forms) # clean the content
return images, forms
if not isinstance(page, PageObject):
page = PageObject(self, page.indirect_reference) # pragma: no cover
if "/Contents" in page:
content = cast(ContentStream, page.get_contents())
images, forms = clean_forms(page, [])
clean(content, images, forms)
page.replace_contents(content)
Hi @pubpub-zz , here is the log from your code. I have to edit it a little to avoid print ContentStream error
print("stack ind: ", [s.indirect_reference for s in stack])
print("stack ind: ", [s.indirect_reference for s in stack if not isinstance(s, ContentStream)])
a) I have some problem to understand your traces : your first line "elt : {'/Type': ..." should indicate an indirect_object can you please provide your full test code to see how you are building your PdfWriter (please provide/use a minimal code)
b) the "elt ---" indicates that the object are not indirect_objects which means that they are newly generated that I do not understand but explains the infinite loop can you please rerun with the following test code (I've seen your note and took that into account however not tested)
def clean_forms(
elt: DictionaryObject, stack: List[DictionaryObject]
) -> Tuple[List[str], List[str]]:
nonlocal to_delete
try:
print("elt : ",type(elt.indirect_reference), elt.indirect_reference)
except Exception:
print("elt ---",e.hash_value())
print("stack ind: ", [(s.indirect_reference if "indirect_reference" in s.__dir__() else "---" )for s in stack ])
print("stack native: ", stack)
print("in? ",elt in stack)
if elt in stack:
# to prevent infinite looping
return [], [] # pragma: no cover
try:
d = cast(
Dict[Any, Any],
cast(DictionaryObject, elt["/Resources"])["/XObject"],
)
except KeyError:
d = {}
images = []
forms = []
for k, v in d.items():
print("k",k)
o = v.get_object()
try:
content: Any = None
if (
to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
and o["/Subtype"] == "/Image"
):
content = NullObject() # to delete the image keeping the entry
images.append(k)
if o["/Subtype"] == "/Form":
forms.append(k)
print("hash(",k,"): ",o.hash_value()," in ",d.hash_value())
print("value :",type(v),v)
if isinstance(o, ContentStream):
content = o
else:
content = ContentStream(o, self)
content.update(
{
k1: v1
for k1, v1 in o.items()
if k1 not in ["/Length", "/Filter", "/DecodeParms"]
}
)
stack.append(elt)
print("call")
clean_forms(content, stack) # clean sub forms
if content is not None:
if isinstance(v, IndirectObject):
self._objects[v.idnum - 1] = content
else:
# should only occur with pdf not respecting pdf spec
# where streams must be indirected.
d[k] = self._add_object(content) # pragma: no cover
except (TypeError, KeyError):
pass
for im in images:
del d[im] # for clean-up
if isinstance(elt, StreamObject): # for /Form
if not isinstance(elt, ContentStream): # pragma: no cover
e = ContentStream(elt, self)
e.update(elt.items())
elt = e
clean(elt, images, forms) # clean the content
return images, forms
if not isinstance(page, PageObject):
page = PageObject(self, page.indirect_reference) # pragma: no cover
if "/Contents" in page:
content = cast(ContentStream, page.get_contents())
images, forms = clean_forms(page, [])
clean(content, images, forms)
page.replace_contents(content)
Hi @pubpub-zz
a, Here is the test code I am using:
import io
from pypdf import PdfReader, PdfWriter
pdf_content = open('file_name.pdf', 'rb').read()
pdf_writer = PdfWriter()
pdf_reader = PdfReader(io.BytesIO(pdf_content))
pdf_writer.add_page(pdf_reader.pages[0])
pdf_writer.remove_text()
b, Here is the log I got with the new code you provided
I start to understand. can you retry with this code (may fix the issue):
def clean_forms(
elt: DictionaryObject, stack: List[DictionaryObject]
) -> Tuple[List[str], List[str]]:
nonlocal to_delete
try:
print("elt : ",type(elt.indirect_reference), elt.indirect_reference.__repr__())
except Exception:
print("elt ---",e.hash_value())
print("stack ind: ", [(s.indirect_reference if "indirect_reference" in s.__dir__() else "---" )for s in stack ])
print("stack native: ", stack)
print("in? ",elt in stack)
if hasattr(elt, "indirect_reference"):
if any(( elt.indirect_reference == getattr(x,"indirect_reference",-1) for x in stack)):
return [], []
if elt in stack:
# to prevent infinite looping
return [], [] # pragma: no cover
try:
d = cast(
Dict[Any, Any],
cast(DictionaryObject, elt["/Resources"])["/XObject"],
)
except KeyError:
d = {}
images = []
forms = []
for k, v in d.items():
print("k",k)
o = v.get_object()
try:
content: Any = None
if (
to_delete & ObjectDeletionFlag.XOBJECT_IMAGES
and o["/Subtype"] == "/Image"
):
content = NullObject() # to delete the image keeping the entry
images.append(k)
if o["/Subtype"] == "/Form":
forms.append(k)
print("hash(",k,"): ",o.hash_value()," in ",d.hash_value())
print("value :",type(v),v)
if isinstance(o, ContentStream):
content = o
else:
content = ContentStream(o, self)
content.update(
{
k1: v1
for k1, v1 in o.items()
if k1 not in ["/Length", "/Filter", "/DecodeParms"]
}
)
try:
content.indirect_reference = o.indirect_reference
except AttributeError:
pass
stack.append(elt)
print("call")
clean_forms(content, stack) # clean sub forms
if content is not None:
if isinstance(v, IndirectObject):
self._objects[v.idnum - 1] = content
else:
# should only occur with pdf not respecting pdf spec
# where streams must be indirected.
d[k] = self._add_object(content) # pragma: no cover
except (TypeError, KeyError):
pass
for im in images:
del d[im] # for clean-up
if isinstance(elt, StreamObject): # for /Form
if not isinstance(elt, ContentStream): # pragma: no cover
e = ContentStream(elt, self)
e.update(elt.items())
elt = e
clean(elt, images, forms) # clean the content
return images, forms
if not isinstance(page, PageObject):
page = PageObject(self, page.indirect_reference) # pragma: no cover
if "/Contents" in page:
content = cast(ContentStream, page.get_contents())
images, forms = clean_forms(page, [])
clean(content, images, forms)
page.replace_contents(content)
Thank you very much. Somehow, using stack.append(elt)
instead of stack + [elt]
fixed the problem. Here is the log for your updated code.
log3.txt
Thank you very much. Somehow, using
stack.append(elt)
instead ofstack + [elt]
fixed the problem. Here is the log for your updated code. log3.txt
Can you confirm that it is only this last proposal that fixes the issue ?
@syanng what would be great would be to have a test file. removing text and replacing all image with some dumb image(looping through images in a page and replacing it with a test image), could you try to produce a test file ?
Yes, the issue has been fixed. Thank you very much. For the test file, could you please guide me on how to replace text content or image content with dumb text and dumb images?
For the test file, could you please guide me on how to replace text content or image content with dumb text and dumb images?
You might want to have a look at the remove_*
methods of PdfWriter
as in the docs: https://pypdf.readthedocs.io/en/latest/modules/PdfWriter.html#pypdf.PdfWriter.remove_links
fixed in #2505