OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Page rotation is not performed when pdf file already contain text

Open sstefanov opened this issue 3 years ago • 2 comments

Describe the bug When ocrmypdf is started with options -r and -s and pdf containing text is rotated, result file is still rotated.

To Reproduce ocrmypdf -r -s a4rotated.pdf a4rotated_ocr.pdf

Expected behavior Output pages to be proper rotated when -r option is used.

System (please complete the following information):

  • OS: Debian testing
  • Python version: 3.8.6
  • OCRmyPDF version: 11.3.0

Additional context When ocrmypdf is started with --redo-ocr page is rotated properly, but original text is still not rotated. With --force-ocr is working as needed.

a4rotated.pdf

sstefanov avatar Oct 27 '20 12:10 sstefanov

I think the existing behavior is the correct behavior. --skip-text means processing on pages that have text is skipped. The intended use case is a PDF that contains a mixture of "born digital" and scanned images, and we want to ensure the entire PDF is searchable. Generally, "born digital" pages are correctly oriented.

Why do you think it should work in the way you propose?

jbarlow83 avatar Oct 28 '20 00:10 jbarlow83

Sometimes "born digital" pages are in not correct orientation (i have such documents). In these cases I see 2 problems:

  1. -r is ignored in combination with --skip-text. I think expected behaviour of -r is to produce result with correctly oriented pages independent of text processing.
  2. --redo-ocr creates completely mess, image is rotated, but text - no.

sstefanov avatar Oct 28 '20 04:10 sstefanov