pdfminer.six
pdfminer.six copied to clipboard
New converter for the hOCR format
Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. It would be good to have a PDFMiner converter that extracts the explicit text information from those PDFs that do have it and uses it to generate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.
I have already developed a solution and am about to submit a pull request.