OCRmyPDF
OCRmyPDF copied to clipboard
support converting multiple images
Currently, you need to use tesseract directly, obsoleting OCRmyPDF, or go over img2pdf to be able to use a set of scans in image format, which is quite common imo, see: https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-images-not-pdfs
It would be nice it OCRmyPDF supported this out of the box, to be a one-stop-solution.
Generally I've been against this because of the Unix philosophy of orthogonal and composable tools. i.e., img2pdf does a great job of combining images and you can pipe that ocrmypdf. Not to say that I can't be talked out of my position if there's a compelling argument.
At the moment I see some design problems with adding this functionality:
- Under the current syntax if you forget the output filename, you'd overwrite the last input file
- There would be pressure to deal with multipage TIFFs and PDFs as input sources
- The one to one page mapping between input and output would be lost - I can't tell you there's an error on page 7 any more, I have to tell you "file 2 page 3" or something.
- It's not obvious how to merge metadata coming from multiple PDFs or images
- I would mainly be wrapping img2pdf but hiding most of its features
- Multiple images may need different DPI overrides to present correctly
- Creating an intermediate PDF with all desired content encourages the user to inspect the file before OCR
- Generally commands that combine multiple inputs into one output use the
command -o output input1 input2
syntax, which I historically haven't used.command in1 in2 out
is rare (notable exception:cp
).
Note it doesn't obsolete ocrmypdf - Tesseract can't do PDF to OCR PDF, which is the main use case of OCRmyPDF, and it can't preprocess, produce PDF/A or optimize images. If all you have are singleton images and you just want them combined into OCR, Tesseract does fine.
Maybe a second command line program ocrmyimages
could be added, using syntax like ocrmyimages -o output.pdf image1.jpg image2.png
to address the input file clobbering issue.
Thanks for the detailed analysis @jbarlow83 ! Now I understand better where this comes from. I think your suggestion would work well and is a clean solution in the Unix philosophy...