python-pdfbox
python-pdfbox copied to clipboard
text extraction hangs on MacOS 10.14
I am trying to use pdfbox
, with this vanilla snippet:
converter = pdfbox.PDFBox()
converter.extract_text(
input_path=str(pdf.absolute()),
output_path=str(txt.absolute()))
But it becomes stuck. I debugged the stack tree, and it hangs at this line:

I confirmed that a Java process is spawned:
➜ jps
5416 Jps
5385
329 <-- spawned process
But it is just stuck there.
Running the cached jar by python-pdfbox
in the terminal works:
java -jar pdfbox-app-2.0.17.jar ExtractText '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.pdf' '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.txt'
So I am no longer sure what's going on. Thoughts?
Environment
Python
python-pdfbox = "==0.1.7" python_version = "3.7"
Java
openjdk version "1.8.0_222" OpenJDK Runtime Environment (build 1.8.0_222-20190711112007.graal.jdk8u-src-tar-gz-b08) OpenJDK 64-Bit GraalVM CE 19.2.0 (build 25.222-b08-jvmci-19.2-b02, mixed mode)
OS
macOS Mojave 10.14.4
I have the same issue. Did you find a solution to this?
@adarsa Not really no. I ended up abandoning pdfbox altogether, and used tesseract to extract text instead.
Does this occur with all PDFs, or only with some? If the latter, can you attach it to this issue?
@lebedov I haven't had the chance to try it on other PDFs, but as for the file I am using in the screenshot, it is this one.
I can't reproduce the hanging problem with the input PDF file you mentioned on Ubuntu Linux 18.0.4 with Python 3.7.3 and OpenJDK 11.0.4. I suspect some sort of platform-specific jpype weirdness, but I unfortunately don't have a MacOS box to debug this. I'll leave the issue open for the time being in case anyone who can investigate further has further input.
I had this issue with all pdf's I tried.
+1
+1
I finally obtained access to a MacOS box. I can't reproduce the problem with Python 3.8.5, OpenJDK 14.0.2, and python-pdfbox 0.1.8 on MacOS 10.15.6; processing the indicated file succeeds without any error.
I had the same issue also on macOS Mojave and this Java JDK version: java version "1.8.0_151" Java(TM) SE Runtime Environment (build 1.8.0_151-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
I installed the openJDK 15 from here and that fixed the issue.
@peterHeuz Given that more than person has encountered the issue on MacOS, I added a note to the package README.