pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Add hOCR output type for pdf2txt

Open hason opened this issue 5 years ago • 5 comments

hOCR is an open standard of data representation for formatted text obtained from optical character recognition (OCR).

hason avatar Jul 13 '19 19:07 hason

Hi @hason. As I understand hOCR it is a html/xml with specific requirements. Pdfminer already supports converting to xml and html. Do you think we can change the existing implementation to support hOCR htmls?

Also, are you planning on using the hOCR output?

pietermarsman avatar Jul 13 '19 22:07 pietermarsman

@pietermarsman At present the html output is the best representation of the PDF. I think, what @hason had mentioned is to only extract text from the pdf. In that case, the existing html converter only shows the bounding box for words, but hOCR would require bounding box for characters, line and paras.

The advantage of having hOCR support would mean that in a given process pipeline the scanned pdf and searchable pdfs can be at par.

@pietermarsman How do we decide if this feature or any other feature request could be a part of the library or not ?

fakabbir avatar Aug 15 '19 05:08 fakabbir

Currently, there is no active group of developers that implements new features. The decision process on which features to implement first is thus non-existing.

I would like to be part of the "active group of developers" in the future, but for now I focus on reviewing existing PR's and fixing bugs.

In the ideal world, people vote on features/bugs that they find most important. And maintainers discuss which features to put in the next release.

pietermarsman avatar Aug 19 '19 15:08 pietermarsman

This issue is stale. Is this still something we are potentially interested in adding? I don't know how much has changed in the last year or so regarding the decision processes etc. Moving this issue to "needs solution" so we can continue the discussion.

jstockwin avatar Jul 09 '20 14:07 jstockwin

I'm not sure how "big" hOCR is. If it is the go-to standard it interesting to implement. It would allow to compare ocr techniques with pdfminer.six more directly.

We can proceed this issue by doing some research in OCR output formats and which ones are often used.

pietermarsman avatar Jul 11 '20 09:07 pietermarsman