python-pdfbox icon indicating copy to clipboard operation
python-pdfbox copied to clipboard

text extraction hangs on MacOS 10.14

Open devcsrj opened this issue 4 years ago • 11 comments

I am trying to use pdfbox, with this vanilla snippet:

converter = pdfbox.PDFBox()
converter.extract_text(
    input_path=str(pdf.absolute()),
    output_path=str(txt.absolute()))

But it becomes stuck. I debugged the stack tree, and it hangs at this line:

Screen Shot 2019-10-02 at 6 25 24 AM

I confirmed that a Java process is spawned:

➜ jps
5416 Jps
5385
329    <-- spawned process

But it is just stuck there.

Running the cached jar by python-pdfbox in the terminal works:

java -jar pdfbox-app-2.0.17.jar ExtractText '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.pdf' '/Users/devcsrj/Projects/devcsrj/klerk/dist/17/SENATE/regular-1/journal-28.txt'

So I am no longer sure what's going on. Thoughts?


Environment

Python

python-pdfbox = "==0.1.7" python_version = "3.7"

Java

openjdk version "1.8.0_222" OpenJDK Runtime Environment (build 1.8.0_222-20190711112007.graal.jdk8u-src-tar-gz-b08) OpenJDK 64-Bit GraalVM CE 19.2.0 (build 25.222-b08-jvmci-19.2-b02, mixed mode)

OS

macOS Mojave 10.14.4

devcsrj avatar Oct 01 '19 22:10 devcsrj

I have the same issue. Did you find a solution to this?

adarsa avatar Nov 12 '19 03:11 adarsa

@adarsa Not really no. I ended up abandoning pdfbox altogether, and used tesseract to extract text instead.

devcsrj avatar Nov 12 '19 06:11 devcsrj

Does this occur with all PDFs, or only with some? If the latter, can you attach it to this issue?

lebedov avatar Nov 12 '19 13:11 lebedov

@lebedov I haven't had the chance to try it on other PDFs, but as for the file I am using in the screenshot, it is this one.

devcsrj avatar Nov 12 '19 21:11 devcsrj

I can't reproduce the hanging problem with the input PDF file you mentioned on Ubuntu Linux 18.0.4 with Python 3.7.3 and OpenJDK 11.0.4. I suspect some sort of platform-specific jpype weirdness, but I unfortunately don't have a MacOS box to debug this. I'll leave the issue open for the time being in case anyone who can investigate further has further input.

lebedov avatar Nov 13 '19 13:11 lebedov

I had this issue with all pdf's I tried.

adarsa avatar Nov 25 '19 10:11 adarsa

+1

suiyuan2009 avatar Apr 13 '20 08:04 suiyuan2009

+1

sprakash93 avatar Jun 02 '20 22:06 sprakash93

I finally obtained access to a MacOS box. I can't reproduce the problem with Python 3.8.5, OpenJDK 14.0.2, and python-pdfbox 0.1.8 on MacOS 10.15.6; processing the indicated file succeeds without any error.

lebedov avatar Aug 05 '20 02:08 lebedov

I had the same issue also on macOS Mojave and this Java JDK version: java version "1.8.0_151" Java(TM) SE Runtime Environment (build 1.8.0_151-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)

I installed the openJDK 15 from here and that fixed the issue.

peterHeuz avatar Jan 12 '21 13:01 peterHeuz

@peterHeuz Given that more than person has encountered the issue on MacOS, I added a note to the package README.

lebedov avatar Jan 12 '21 16:01 lebedov