PyMuPDF
PyMuPDF copied to clipboard
PyMuPDF==1.24.0 will hanging when using page.get_text("text")
Description of the bug
os: linux Ubuntu 22.04 LTS python 3.10.2
When I upload a PDF file, the program hangs for several hours without exiting When using get_text() method.
How to reproduce the bug
>>> import fitz as pymupdf
>>> pdf_path = '/data/dataset/book/pdf/bad/e4a0626f933941c6db3257f5cea4f3e5.pdf'
>>> def parse_test(pdf_path) -> str:
... pymu_doc = pymupdf.open(pdf_path, filetype="pdf")
... contents = []
... try:
... if not pymu_doc:
... return contents
... for _, page in enumerate(pymu_doc):
... content = page.get_text("text")
... contents.append(content.replace('\n', ' '))
... except Exception:
... contents = []
... return '\n'.join(contents)
...
>>> a = parse_test(pdf_path)
e4a0626f933941c6db3257f5cea4f3e5.pdf
PyMuPDF version
1.24.0
Operating system
Linux
Python version
3.10
This is base library problem occurring on the first page - the second page works. I will open a bug in MuPDF's issue system.
When providing code snippets, please use properly indented code blocks using slash commands
Here is the reference to the MuPDF bug: https://bugs.ghostscript.com/show_bug.cgi?id=707721
I encountered the same on Friday with 1.24.1
. Rolling back to 1.23.14
"fixed" it.
Interestingly enough, it was triggered on a .pdf
with the same first page as @xiaominghero shared (attached).
(❗ note that we encountered this on different files, not just this one)
import fitz
doc = fitz.open("I_break_things.pdf")
doc
Interestingly enough, it was triggered on a .pdf with the same first page as @xiaominghero shared (attached).
It is this one page that causes the problem - not any other.
@jan-benisek Unfortunately, Version 1.23.14 is not OK for me.
This has been fixed in v1.24.2.
Hi @JorjMcKie ,
I still run into the same issues on 1.24.2
while testing on the file (the first page) that @jan-benisek submitted.
That is,
import fitz
fitz.open("I_break_things.pdf")[0].get_text()
hangs, while working fine on 1.23.14
.
Sorry - my bad. The required / developed MuPDF fix is not contained in this version yet. I am re-opening the issue.
No worries, and thanks a lot for all your work.
Fixed in 1.24.3.