PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

PyMuPDF==1.24.0 will hanging when using page.get_text("text")

Open xiaominghero opened this issue 10 months ago • 10 comments

Description of the bug

os: linux Ubuntu 22.04 LTS python 3.10.2

When I upload a PDF file, the program hangs for several hours without exiting When using get_text() method.

image

How to reproduce the bug

>>> import fitz as pymupdf
>>> pdf_path = '/data/dataset/book/pdf/bad/e4a0626f933941c6db3257f5cea4f3e5.pdf'
>>> def parse_test(pdf_path) -> str:
...     pymu_doc = pymupdf.open(pdf_path, filetype="pdf")
...     contents = []
...     try:
...         if not pymu_doc:
...             return contents
...         for _, page in enumerate(pymu_doc):
...             content = page.get_text("text")
...             contents.append(content.replace('\n', ' '))
...     except Exception:
...         contents = []
...     return '\n'.join(contents)
...
>>> a = parse_test(pdf_path)

e4a0626f933941c6db3257f5cea4f3e5.pdf

PyMuPDF version

1.24.0

Operating system

Linux

Python version

3.10

xiaominghero avatar Apr 07 '24 09:04 xiaominghero

This is base library problem occurring on the first page - the second page works. I will open a bug in MuPDF's issue system.

JorjMcKie avatar Apr 07 '24 10:04 JorjMcKie

When providing code snippets, please use properly indented code blocks using slash commands image

JorjMcKie avatar Apr 07 '24 10:04 JorjMcKie

Here is the reference to the MuPDF bug: https://bugs.ghostscript.com/show_bug.cgi?id=707721

JorjMcKie avatar Apr 07 '24 10:04 JorjMcKie

I encountered the same on Friday with 1.24.1. Rolling back to 1.23.14 "fixed" it. Interestingly enough, it was triggered on a .pdf with the same first page as @xiaominghero shared (attached). (❗ note that we encountered this on different files, not just this one)

import fitz
doc = fitz.open("I_break_things.pdf")
doc

I_break_things.pdf

jan-benisek avatar Apr 08 '24 07:04 jan-benisek

Interestingly enough, it was triggered on a .pdf with the same first page as @xiaominghero shared (attached).

It is this one page that causes the problem - not any other.

JorjMcKie avatar Apr 08 '24 08:04 JorjMcKie

@jan-benisek Unfortunately, Version 1.23.14 is not OK for me.

xiaominghero avatar Apr 09 '24 16:04 xiaominghero

This has been fixed in v1.24.2.

JorjMcKie avatar Apr 18 '24 14:04 JorjMcKie

Hi @JorjMcKie ,

I still run into the same issues on 1.24.2 while testing on the file (the first page) that @jan-benisek submitted.

That is,

import fitz

fitz.open("I_break_things.pdf")[0].get_text()

hangs, while working fine on 1.23.14.

henriklaurentz avatar Apr 19 '24 08:04 henriklaurentz

Sorry - my bad. The required / developed MuPDF fix is not contained in this version yet. I am re-opening the issue.

JorjMcKie avatar Apr 19 '24 10:04 JorjMcKie

No worries, and thanks a lot for all your work.

henriklaurentz avatar Apr 19 '24 12:04 henriklaurentz

Fixed in 1.24.3.