Extend the TesseractOCRParser with PDF output
Currently the TesseractOCRParser supports two output formats: plain text and HOCR. The second was recently added by Eric Pugh (https://github.com/apache/tika/pull/133). My question is if we should add the third output option 'PDF' which is provided by Tesseract?
I am not sure if it is enough to add the output type as I did in the TesseractOCRConfig.
The other discussion point is if this feature fits the focus of the Tika Project. See the discussion here: https://lists.apache.org/thread.html/d1c65367a8bfe13ebc977f6aff8abdfc3e9e09dbce429411dd554840@%3Cuser.tika.apache.org%3E
Hi, why do I use tika-app-1.20.jar to identify the PDF, but I cannot identify the content?There are only images in the PDF, and there is text on the image. I guess it should call tesserect OCR to do the recognition, but I don't find that it does the work. Why