python-pdfbox extract_text goes on forever.

extract_text goes on forever.

Open Rammurthy5 opened this issue 4 years ago • 3 comments

I installed latest PDFBox on my Mac via pip. I did an import and called on to the extract_text() method. And it keeps running perpetually for a 196 KB file. Please help.

>>> import pdfbox as p, os
>>> os.path.exists(f).  # f is the file path
True
>>> pp = p.PDFBox()
>>> pp.extract_text(f)

extract_text(f) doesn't end, runs perpetually.

Aug 04 '20 06:08 Rammurthy5

What version of Python, Java, and MacOS are you running? Can you attach the file you are trying to process? As noted in #14, I haven't been able to reproduce the problem.

Aug 05 '20 02:08 lebedov

macOS: 10.15.6 Python: 3.7.1 Java: 1.8.0_202 pdf copy.pdf File attached.

Aug 05 '20 04:08 Rammurthy5

I didn't encounter any errors with the file you posted using the package versions in #14. Can you try using OpenJDK 14 rather than Oracle's Java?

Aug 06 '20 18:08 lebedov

python-pdfbox python-pdfbox copied to clipboard

extract_text goes on forever.

python-pdfbox
python-pdfbox copied to clipboard