OCRmyPDF support converting multiple images

Currently, you need to use tesseract directly, obsoleting OCRmyPDF, or go over img2pdf to be able to use a set of scans in image format, which is quite common imo, see: https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-images-not-pdfs

It would be nice it OCRmyPDF supported this out of the box, to be a one-stop-solution.

Oct 21 '19 15:10 grexe

Generally I've been against this because of the Unix philosophy of orthogonal and composable tools. i.e., img2pdf does a great job of combining images and you can pipe that ocrmypdf. Not to say that I can't be talked out of my position if there's a compelling argument.

At the moment I see some design problems with adding this functionality:

Under the current syntax if you forget the output filename, you'd overwrite the last input file
There would be pressure to deal with multipage TIFFs and PDFs as input sources
The one to one page mapping between input and output would be lost - I can't tell you there's an error on page 7 any more, I have to tell you "file 2 page 3" or something.
It's not obvious how to merge metadata coming from multiple PDFs or images
I would mainly be wrapping img2pdf but hiding most of its features
Multiple images may need different DPI overrides to present correctly
Creating an intermediate PDF with all desired content encourages the user to inspect the file before OCR
Generally commands that combine multiple inputs into one output use the command -o output input1 input2 syntax, which I historically haven't used. command in1 in2 out is rare (notable exception: cp).

Note it doesn't obsolete ocrmypdf - Tesseract can't do PDF to OCR PDF, which is the main use case of OCRmyPDF, and it can't preprocess, produce PDF/A or optimize images. If all you have are singleton images and you just want them combined into OCR, Tesseract does fine.

Oct 21 '19 20:10 jbarlow83

Maybe a second command line program ocrmyimages could be added, using syntax like ocrmyimages -o output.pdf image1.jpg image2.png to address the input file clobbering issue.

Nov 01 '19 22:11 jbarlow83

Thanks for the detailed analysis @jbarlow83 ! Now I understand better where this comes from. I think your suggestion would work well and is a clean solution in the Unix philosophy...

Nov 01 '19 22:11 grexe

OCRmyPDF OCRmyPDF copied to clipboard

support converting multiple images

OCRmyPDF
OCRmyPDF copied to clipboard