pdfminer.six
pdfminer.six copied to clipboard
Add HOCRConverter (fixes #650)
Pull request
Fix https://github.com/pdfminer/pdfminer.six/issues/650 Fix https://github.com/pdfminer/pdfminer.six/issues/265
Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. This converter extracts the explicit text information from those PDFs that do have it and uses it to genxerate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.
How Has This Been Tested?
layout = LAParams(all_texts=True)
extract_text_to_fp(in_file, out_file, output_type='hocr', laparams=layout)
tox also runs with Python 3.8 and 3.9.
Checklist
- [ x] I have added tests that prove my fix is effective or that my feature works
- [x ] I have added docstrings to newly created methods and classes
- [x ] I have optimized the code at least one time after creating the initial version
- [x ] I have updated the README.md or I am verified that this is not necessary
- [x ] I have updated the readthedocs documentation or I verified that this is not necessary
- [x ] I have added a consice human-readable description of the change to CHANGELOG.md
Would be amazing if this could be merged and included!
Looks good to me.
I only wonder if this is something that should be added to pdfminer.six as core functionality. Alternatively, this could be something that everyone implements to their own liking. The composable api is perfectly suitable for adding functionality like this.
I'll post this question on the gitter.
After some delibration I'm positive on adding hocr as an output format. It has two advantages: direct comparison of the output to ocr tools and usage of other tools (e.g. visualization) built for hocr.
I'll do a more detailed review now.
@richardpaulhudson I used this PR a bit for testing if the new CI pipeline is functioning properly. Now it is :)
@richardpaulhudson any plans on working on this in the future?
Hi @pietermarsman, thank you for the review and sorry for not responding sooner — I've changed employers in the meantime and there seem to be issues with where my GitHub notification mails are ending up. I hope to be able to pick up working on this in the next couple of months.
FYI, I've changed this MR to merge into master. The develop branch will be removed, because soon we will work with version tags to indicate the releases and the distinction between develop and master becomes obsolete.
bump ;)
Sorry it's taken me so long to get back to this :-)
Can you add tests showing this works. Ideally you would use the simple1.pdf for this.
I can certainly see the need for some sort of regression test, but am unsure how to approach it. What I actually did myself was:
- checked the hOCR output passed
hocr-check
(from thehocr-tools
package) - commented in
hocrjs
and checked the rendering of the content in the browser corresponded to the original PDF file
neither of which lend themselves easily to a regression test.
The options are:
- a regression test that just checks the conversion is carried out successfully without an error
- a regression test that checks the output of the conversion is equal to the output of my conversion which I have verified with the two steps above. Issues with this are:
- there may be problems with the output that I'm not aware of because they weren't picked up by these two steps, but such a test would declare the output to be correct
- tests comparing large amounts of output at once tend to be brittle
- a regression test that checks the output of the conversion for specific features, although I'm unsure what these would be
I prefer option 1 (just checking if the code does not raise an error) or 2 (check for specific output). If you go for two, we do indeed need to have some output that we know is reasonably stable.
Having a test with output (option 2) is also a start of some documentation, as other developers can easily see what the expected output is of the tool
@richardpaulhudson Thanks for the all your work!