OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

--rotate-pages is being done but its results are ignored when --skip-text is enabled

Open tamarit opened this issue 5 years ago • 16 comments

Hi, we have observed that in pdfs with text, e.g. a pdf where all the pages except the last one contain text, the rotation process is performed but the resulting pages are ignored, i.e. all those pages with text that should be rotated are left unrotated. We expected two alternative behaviors (but not this one):

  1. When the page has text, no rotation process is performed, i.e. even if the page should be rotated, it is left as it was.
  2. If a page with text should be rotated, then rotate it.

We do not understand the current behavior, so if anyone could explain us it it will be fantastic. Thanks in advance and congratulation for your nice tool!

tamarit avatar Dec 13 '18 14:12 tamarit

Could you provide an example pdf?

jbarlow83 avatar Dec 13 '18 17:12 jbarlow83

Hi,

We are using a PDF where both pages are exactly the same with the exception that the first one has text while the second one has none. The command we run is the following:

$ ocrmypdf --language spa --clean --deskew --rotate-pages --rotate-pages-threshold 3 --skip-text

and the shell outputs the following:

example.pdf outexample.pdf
   INFO -    1: page already has text! – skipping all processing on this page
   INFO -    2: page is facing ⇦, confidence 3.77 - will rotate
   INFO -    1: page is facing ⇦, confidence 25.70 - will rotate
   INFO - Optimize ratio: 1.00 savings: 0.3%
   INFO - Output file is a PDF/A-2B (as expected)

As we can see in the output PDF, although both pages need rotation, only the one that has no text has been rotated. This would have sense if skip-text flag simply ignored the first page. However, since the rotation process is being run in this page, then we do not understand why it is left unrotated.

Thanks in advance!

tamarit avatar Dec 14 '18 08:12 tamarit

@tamarit curious to know if you figured out a resolution to this problem and how exactly? Thanks.

ajab21 avatar Mar 06 '19 22:03 ajab21

@ajab21 No. We still have the issue as fresh as when it was reported. :(

tamarit avatar Mar 07 '19 09:03 tamarit

@jbarlow83 I'm assuming this happens because --skip-text forces all processing to be skipped on the page that has existing text, including the page rotation that's needed. Any way around this? Or, is an enhancement needed to the processing logic for this scenario when both --skip-text and --rotate-pages need to be used in the command?

ajab21 avatar Mar 08 '19 00:03 ajab21

Has anyone dug into the code around this yet?

jrk2401 avatar Mar 24 '19 14:03 jrk2401

@ajab21 As you suspected, and if I'm reading it right, the current code skips all processing on the page once it finds existing text. @jbarlow83 is this by design? As you're looking for the most conservative handling of the original PDF?

jrk2401 avatar Mar 24 '19 14:03 jrk2401

The current behavior wasn't quite planned. Yes it does calculate the orientation for skipped pages and says they will be fixed but then does not fix them.

For skipped pages, no files/symlinks be generated that match the weave_layers regex (which selects ocr.oriented.pdf but not skip.oriented.pdf). This means the loop that rotation fixes will not iterate over pages that were skipped.

Unfortunately I can't address this for a few weeks.

I haven't decided on what the correct behavior should be. I'm considering --rotate-pages --skip-text means no evaluation or rotation of pages that have text, and --rotate-pages --redo-ocr means consider re-rotation of all pages.

jbarlow83 avatar Mar 24 '19 21:03 jbarlow83

hi @jbarlow83, please keep us posted on what you decide. improvement here with the code better understanding config intention would be great plus. thanks!

ajab21 avatar Jul 22 '19 20:07 ajab21

It would me great to have the option to run the page rotation routine outside of OCR. In my case I OCR'd ~300 files using 12 cores and it took 2 days. Would be nice to not have to run the whole process again. Maybe with the option to import ocrmypdf as a module now, I can figure that out.

Similarly, looking at importing optimize.py as a module, but so far the dependencies fail for me, but should be possible, yes?

jrk2401 avatar Jul 29 '19 17:07 jrk2401

@ajab21 No change in behavior for 9.0.0, which I really wanted to get out for other reasons.

jbarlow83 avatar Jul 29 '19 19:07 jbarlow83

@jrk2401 I think you could monkeypatch it but this isn't quite exposed behavior. I'm still working on plugins and hooks at major decision points, but I cut that feature from this release.

You should be able to import optimize or even run it as python3 -m ocrmypdf.optimize. What dependencies fail?

jbarlow83 avatar Jul 29 '19 19:07 jbarlow83

Leptonica for one, but my environment variables might be to blame, as Leptonica is installed for sure. It's quite possible I broke my working installation of OCRmyPDF just messing about. Will try a clean installation of the latest version and see how that goes.

On Mon, Jul 29, 2019, 3:52 PM jbarlow83 [email protected] wrote:

@jrk2401 https://github.com/jrk2401 I think you could monkeypatch it but this isn't quite exposed behavior. I'm still working on plugins and hooks at major decision points, but I cut that feature from this release.

You should be able to import optimize or even run it as python3 -m ocrmypdf.optimize. What dependencies fail?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/324?email_source=notifications&email_token=AGJSVFNFH22JFHSQXCXF3MTQB5C77A5CNFSM4GKHBCP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3B2FMI#issuecomment-516137649, or mute the thread https://github.com/notifications/unsubscribe-auth/AGJSVFNSA2AMX377XBBLLN3QB5C77ANCNFSM4GKHBCPQ .

jrk2401 avatar Jul 29 '19 19:07 jrk2401

@jbarlow83 actually python3 -m ocrmypdf.optimize works a treat, thanks for the hint. I'm going to upgrade anyway. I figure I can fix the rotated pages with pikepdf once I have a list of them.

jrk2401 avatar Jul 29 '19 23:07 jrk2401

Has there been any progress in allowing for rotation without actually rewriting the OCR layer? One of my use cases is just for detecting and then rotating pdf pages to their correct orientation. But this seems to only work if I either generate an OCR layer (if none exists), or overwrite with --force-ocr (if already present).

Is there an alternate way of achieving this? OCR takes a lot of time and CPU, and sometimes I just want to quickly correct the pdf rather than wait for the full process (which I could schedule and batch later)

SterlingHooten avatar Sep 21 '22 09:09 SterlingHooten

You could use a plugin that allows rotation detection to occur as normal, but suppresses regular OCR. That wasn't available when this issue was live. Pretty much just "subclassing" the existing OCR plugin. That would make it more doable to have a "autorotatemypdf".

jbarlow83 avatar Sep 21 '22 10:09 jbarlow83