OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Add command to skip all processing related to OCR

Open wpzdm opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. The time spent in OCR stage is quite long even with --tesseract-timeout=0, at least half of the time of a complete OCR.

Describe the solution you'd like It seems we need a --no-ocr command to skip all processing related to OCR, including OCR image. --tesseract-timeout=0 is a compromise when we do not have that command. Of course, it is a bit weird to add --no-ocr into a tool that is made for OCR. Still, I find OCRmypdf handy for post-processing, like image optimization and save to PDF/A.

Additional context See #647

wpzdm avatar Oct 09 '20 01:10 wpzdm

I am looking for a CLI linux alternative to mobile scanner apps ala CamScanner, and I, too, like to see a —no-ocr option.

NightMachinery avatar Oct 16 '20 12:10 NightMachinery

I am a new user and I do not quite understand this issue. When you write:

"The time spent in OCR stage is quite long even with --tesseract-timeout=0, at least half of the time of a complete OCR."

What you mean by "a complete OCR"? I thought you did not want to do OCR in the first place.

Isn't "--tesseract-timeout=0" equivalent to a hypothetical "--no-ocr" option? This is from the "cookbook" documentation page: -----8<-----8<-----8<----- Don’t actually OCR my PDF

If you set --tesseract-timeout 0 OCRmyPDF will apply its image processing without performing OCR, if all you want to is to apply image processing or PDF/A conversion. -----8<-----8<-----8<-----

Is that not exactly what you want with "--no-ocr"?

rdiez avatar Feb 06 '22 12:02 rdiez

From #647 the opener of this issue was complaining that even with --tesseract-timeout=0, ocrmypdf still seemed slow. The reason is partially that tesseract timeout is a bit of hack - we still do everything as if OCR were happening, we just let it timeout immediately. Basically I realized it was a "free" feature that dropped out from the existing design without any extra work on my part.

Understandably users would like the no-OCR case to be better packaged and more efficient but that does need extra work on my part, hence the delay.

jbarlow83 avatar Feb 06 '22 23:02 jbarlow83