Extend the TesseractOCRParser with PDF output

Open rsoika opened this issue 6 years ago • 1 comments

Currently the TesseractOCRParser supports two output formats: plain text and HOCR. The second was recently added by Eric Pugh (https://github.com/apache/tika/pull/133). My question is if we should add the third output option 'PDF' which is provided by Tesseract?

I am not sure if it is enough to add the output type as I did in the TesseractOCRConfig.

The other discussion point is if this feature fits the focus of the Tika Project. See the discussion here: https://lists.apache.org/thread.html/d1c65367a8bfe13ebc977f6aff8abdfc3e9e09dbce429411dd554840@%3Cuser.tika.apache.org%3E

Apr 26 '19 06:04 rsoika

Hi, why do I use tika-app-1.20.jar to identify the PDF, but I cannot identify the content?There are only images in the PDF, and there is text on the image. I guess it should call tesserect OCR to do the recognition, but I don't find that it does the work. Why

May 08 '19 09:05 changetoblow