pdfminer.six Add HOCRConverter (fixes #650)

Pull request

Fix https://github.com/pdfminer/pdfminer.six/issues/650 Fix https://github.com/pdfminer/pdfminer.six/issues/265

Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. This converter extracts the explicit text information from those PDFs that do have it and uses it to genxerate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.

How Has This Been Tested?

layout = LAParams(all_texts=True)
extract_text_to_fp(in_file, out_file, output_type='hocr', laparams=layout)

tox also runs with Python 3.8 and 3.9.

Checklist

[ x] I have added tests that prove my fix is effective or that my feature works
[x ] I have added docstrings to newly created methods and classes
[x ] I have optimized the code at least one time after creating the initial version
[x ] I have updated the README.md or I am verified that this is not necessary
[x ] I have updated the readthedocs documentation or I verified that this is not necessary
[x ] I have added a consice human-readable description of the change to CHANGELOG.md

Jul 29 '21 20:07 richardpaulhudson

Would be amazing if this could be merged and included!

Dec 02 '21 12:12 willaaam

Looks good to me.

I only wonder if this is something that should be added to pdfminer.six as core functionality. Alternatively, this could be something that everyone implements to their own liking. The composable api is perfectly suitable for adding functionality like this.

I'll post this question on the gitter.

Jan 25 '22 20:01 pietermarsman

After some delibration I'm positive on adding hocr as an output format. It has two advantages: direct comparison of the output to ocr tools and usage of other tools (e.g. visualization) built for hocr.

I'll do a more detailed review now.

Jan 30 '22 14:01 pietermarsman

@richardpaulhudson I used this PR a bit for testing if the new CI pipeline is functioning properly. Now it is :)

Feb 02 '22 21:02 pietermarsman

@richardpaulhudson any plans on working on this in the future?

Feb 22 '22 20:02 pietermarsman

Hi @pietermarsman, thank you for the review and sorry for not responding sooner — I've changed employers in the meantime and there seem to be issues with where my GitHub notification mails are ending up. I hope to be able to pick up working on this in the next couple of months.

Mar 11 '22 09:03 richardpaulhudson

FYI, I've changed this MR to merge into master. The develop branch will be removed, because soon we will work with version tags to indicate the releases and the distinction between develop and master becomes obsolete.

Mar 19 '22 16:03 pietermarsman

bump ;)

Jun 25 '22 20:06 pietermarsman

Sorry it's taken me so long to get back to this :-)

Can you add tests showing this works. Ideally you would use the simple1.pdf for this.

I can certainly see the need for some sort of regression test, but am unsure how to approach it. What I actually did myself was:

checked the hOCR output passed hocr-check (from the hocr-tools package)
commented in hocrjs and checked the rendering of the content in the browser corresponded to the original PDF file

neither of which lend themselves easily to a regression test.

The options are:

a regression test that just checks the conversion is carried out successfully without an error
a regression test that checks the output of the conversion is equal to the output of my conversion which I have verified with the two steps above. Issues with this are:
- there may be problems with the output that I'm not aware of because they weren't picked up by these two steps, but such a test would declare the output to be correct
- tests comparing large amounts of output at once tend to be brittle
a regression test that checks the output of the conversion for specific features, although I'm unsure what these would be

Jul 13 '22 14:07 richardpaulhudson

I prefer option 1 (just checking if the code does not raise an error) or 2 (check for specific output). If you go for two, we do indeed need to have some output that we know is reasonably stable.

Having a test with output (option 2) is also a start of some documentation, as other developers can easily see what the expected output is of the tool

Aug 08 '22 20:08 pietermarsman

@richardpaulhudson Thanks for the all your work!

Aug 14 '22 09:08 pietermarsman

pdfminer.six pdfminer.six copied to clipboard

Add HOCRConverter (fixes #650)

pdfminer.six
pdfminer.six copied to clipboard