python-pdfbox
python-pdfbox copied to clipboard
Extracting order pre-definable?
Hi Guys,
Just wondering for a pdf file, if the text extraction order can be defined? As pointed out here, is there similar setting to adjust the extracting order?
This images shows the error.
AUB_Financials_Dec_2018_pg9.pdf
Much appreciated any insights.
Thanks. Luke
Does the sort
option of the extract_text
method do what you need? If not, you will have to look into wrapping pdfbox's dev API (by design, python-pdfbox only exposes pdfbox's command line interface); I have posted a gist that demonstrates how to access the API from Python that you can use as a starting point for wrapping the PDFTextStripper
Java class so that you can run the setSortByPosition()
method.
@zevio, if you delete the pdfbox-app*jar file cached by python-pdfbox (in ~/.cache/python-pdfbox
on Linux or ~/Library/Caches/python-pdfbox
on MacOS), the latest jar file will be downloaded the next time you import the package.
I was about to correct my suggestion. Actually I think the issue is not directly linked to the jar file version but to the -sort option as you previously said. The same issue currently happens with Apache Tika, that bundles PDFBox. But calling setSortByPosition() does not seem to work at my end neither changing the configuration file in Apache Tika. Still, using the -sort option with the jar file corrects most of my issues. However and surprisingly, I obtained much better results with OCR (Pytesseract) for PDF content extraction.