OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

support converting multiple images

Open grexe opened this issue 5 years ago • 3 comments

Currently, you need to use tesseract directly, obsoleting OCRmyPDF, or go over img2pdf to be able to use a set of scans in image format, which is quite common imo, see: https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#ocr-images-not-pdfs

It would be nice it OCRmyPDF supported this out of the box, to be a one-stop-solution.

grexe avatar Oct 21 '19 15:10 grexe

Generally I've been against this because of the Unix philosophy of orthogonal and composable tools. i.e., img2pdf does a great job of combining images and you can pipe that ocrmypdf. Not to say that I can't be talked out of my position if there's a compelling argument.

At the moment I see some design problems with adding this functionality:

  • Under the current syntax if you forget the output filename, you'd overwrite the last input file
  • There would be pressure to deal with multipage TIFFs and PDFs as input sources
  • The one to one page mapping between input and output would be lost - I can't tell you there's an error on page 7 any more, I have to tell you "file 2 page 3" or something.
  • It's not obvious how to merge metadata coming from multiple PDFs or images
  • I would mainly be wrapping img2pdf but hiding most of its features
  • Multiple images may need different DPI overrides to present correctly
  • Creating an intermediate PDF with all desired content encourages the user to inspect the file before OCR
  • Generally commands that combine multiple inputs into one output use the command -o output input1 input2 syntax, which I historically haven't used. command in1 in2 out is rare (notable exception: cp).

Note it doesn't obsolete ocrmypdf - Tesseract can't do PDF to OCR PDF, which is the main use case of OCRmyPDF, and it can't preprocess, produce PDF/A or optimize images. If all you have are singleton images and you just want them combined into OCR, Tesseract does fine.

jbarlow83 avatar Oct 21 '19 20:10 jbarlow83

Maybe a second command line program ocrmyimages could be added, using syntax like ocrmyimages -o output.pdf image1.jpg image2.png to address the input file clobbering issue.

jbarlow83 avatar Nov 01 '19 22:11 jbarlow83

Thanks for the detailed analysis @jbarlow83 ! Now I understand better where this comes from. I think your suggestion would work well and is a clean solution in the Unix philosophy...

grexe avatar Nov 01 '19 22:11 grexe