ocrmypdf-auto icon indicating copy to clipboard operation
ocrmypdf-auto copied to clipboard

Feature request: Processing of non-PDF sources

Open bland328 opened this issue 5 years ago • 2 comments

I realize this may be thoroughly outside the intended scope of this project, but it would be wonderful if it would process not just PDF files, but a variety of image files (tiff and jpg come to mind). Perhaps passing them directly to to tesseract-ocr and outputting the results as text files?

Thanks for the fantastic Unraid docker container, and for your consideration!

bland328 avatar Jun 09 '20 15:06 bland328

it could work almost out of the box. ocrmypdf can process images according to the docs https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#option-use-ocrmypdf-single-images-only

ocrmypdf-auto only needs to allow the extension to be processed. I added jpg to the extension list, but ocrmypdf failed with a picture i took on my phone due to invalid DPI values. Did not try it out further.

The way to go would probably be to use img2pdf for images and then feed them to ocrmypdf.

jo-me avatar Nov 16 '20 17:11 jo-me

An initial step toward this support is now available with @jo-me's latest updates. The image now supports the .jpg extension by passing a jpg file directly to ocrmypdf, though from @jo-me's experiments, it sounds like this is not sufficient for proper OCR in all cases.

I will keep this issue open to track the feature request and see whether it is reasonable to add img2pdf preprocessing in the container in a future update.

cmccambridge avatar Nov 20 '20 03:11 cmccambridge