pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

New converter for the hOCR format

Open richardpaulhudson opened this issue 3 years ago • 0 comments

Where text is being extracted from a variety of types of PDF within a business process, those PDFs where the text is only present in image form will need to be analysed using an OCR tool which will typically output hOCR. It would be good to have a PDFMiner converter that extracts the explicit text information from those PDFs that do have it and uses it to generate a basic hOCR representation that is designed to be used in conjunction with the image of the PDF in the same way as genuine OCR output would be, but without the inevitable OCR errors.

I have already developed a solution and am about to submit a pull request.

richardpaulhudson avatar Jul 29 '21 20:07 richardpaulhudson