OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Correct way to deskew PDF already processed by OCRmyPDF?

Open pimlottc opened this issue 1 year ago • 7 comments

Describe the question What is the correct way to deskew a PDF that has already been OCR'd by OCRmyPDF? OCRmyPDF complains whenever the input PDF already has OCR'd text. --redo-ocr sounds like the most appropriate option but it listed as not compatible with --deskew.

Rationale I have an automated workflow to OCR documents after scanning them from my desktop scanner. I often want to deskew documents from my scanner that are a little off but sometimes deskew messes up a document that was already fine. So I'd rather have just all documents OCR'd automatically and then manual deskew if it seems like it actually needs it.

To Reproduce

  1. Run OCR: ocrmypdf receipt.pdf receipt.pdf
  2. View document and decide it needs to be deskewed
  3. Run again: ocrmypdf --deskew --redo-ocr receipt.pdf receipt.pdf

Expected behavior Output document is deskewed and OCR'd with the same result as if I had included --deskew the first time.

Actual behavior

% ocrmypdf --deskew --redo-ocr receipt.pdf receipt.pdf
--redo-ocr is not currently compatible with --deskew, --clean-final, and --remove-background

Screenshots If applicable, add screenshots to help explain your problem.

System (please complete the following information):

  • OS: macOS
  • Python version: 3.11
  • OCRmyPDF version: 14.0.3
  • Platform: ARM (Apple M1)

Installation brew install ocrmypdf

pimlottc avatar Mar 18 '23 16:03 pimlottc

I also see that --force-ocr is another option that is supported by --deskew; however the description is less clear and make it sounds like force-ocr is potentially lossy and/or destructive, as it describes rasterizing text and vector elements and "rewriting the PDF":

Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF)

Exactly what "rewriting the PDF" means is not clear to me; isn't the PDF /always/ rewritten, since it's adding/updating an OCR layer? More likely it's talking about rewriting portions of the PDF aside from the OCR layer. If that's the case, wouldn't options like --deskew and --clean-final be doing that too? But the descriptions for those don't say anything about that; --force-ocr is the only option that is specifically called out as "rewriting the PDF".

pimlottc avatar Mar 18 '23 16:03 pimlottc

Further adding to my confusion is that running OCR + deskew in two steps (via --force-ocr) results in a slightly larger file than running OCR + deskew in one step:

In 1 step:

% ocrmypdf --deskew receipt.orig.pdf receipt.ocr-deskew.pdf 

In 2 steps:

% ocrmypdf receipt.orig.pdf receipt.ocr.pdf
% ocrmypdf --deskew --force-ocr receipt.ocr.pdf receipt.ocr-deskew-force-ocr.pdf

Results:

-rw-r--r--   1 chris  staff  477262 Mar 18 11:43 receipt.orig.pdf
-rw-r--r--   1 chris  staff  187114 Mar 18 11:44 receipt.ocr-deskew.pdf
-rw-r--r--   1 chris  staff  363849 Mar 18 11:44 receipt.ocr.pdf
-rw-r--r--   1 chris  staff  192836 Mar 18 11:45 receipt.ocr-deskew-force-ocr.pdf

This makes me less sure that it's doing the same thing.

pimlottc avatar Mar 18 '23 16:03 pimlottc

--redo-ocr by design requires the image to remain exactly the same -- which conflicts with deskewing any other operation that alters the page image.

--force-ocr asks ocrmypdf to rasterize an image of the page and mostly discard the original contents. The raster image often ends up being larger than the original. (For example, a page that was 400 dpi black and white on a 200 dpi color background will be upgraded to 400 dpi color.)

--force-ocr --deskew is the best available option to re-process. You can try more aggressive optimization to reduce file size after the fact.

jbarlow83 avatar Mar 18 '23 19:03 jbarlow83

--force-ocr asks ocrmypdf to rasterize an image of the page and mostly discard the original contents. The raster image often ends up being larger than the original. (For example, a page that was 400 dpi black and white on a 200 dpi color background will be upgraded to 400 dpi color.)

Is this any different than what happens when you do ocrmypdf --deskew on a PDF that doesn't have an OCR layer?

pimlottc avatar Mar 19 '23 03:03 pimlottc

ocrmypdf --deskew will not run on a PDF that contains text. If the PDF contains text we exit with an error. ocrmypdf --deskew --skip-text will only deskew pages that contain no text. (Useful when you have say, a PDF generated from Word with scanned pages attached. ocrmypdf --deskew --force-ocr will deskew everything.

jbarlow83 avatar Mar 19 '23 03:03 jbarlow83

Okay, I understand that, my question is whether --deskew on a PDF without text also ends up doing the same rasterization as --force-ocr (which as you say could change the DPI and increase the filesize).

Basically, I want to get as close to same result with deskewing a PDF that was already OCR'd by OCRmYPDF as I would by running --deskew the first time. Ideally the original image data is alternated to the minimum amount necessary.

Obviously just running --deskew must be changing the image data but the description of --force-ocr makes it sounds like it might be more extreme.

pimlottc avatar Mar 19 '23 03:03 pimlottc

Okay, I spent some time reading the code, please correct me if I'm wrong here.

Based on the definition of options.lossless_reconstruction, full rasterization (with all the effects your described earlier) occurs when ANY of the follow options are specified:

  • --force-ocr
  • --deskew
  • --clean-final
  • --remove-background

So really, the warning text in --help for --force-ocr applies to any of these arguments. Perhaps they should all have a marker like *LOSSY* or be in their own section, e.g.

Lossy options:
  Using these options will cause the visual data of the PDF to be rewritten.
  Any text or vector objects will be rasterized [..]

  --remove-background   Attempt to remove background from gray or color pages,
                        setting it to white.
  -d, --deskew          Deskew each page before performing OCR.
  -i, --clean-final     Same as --clean, but also incorporate the cleaned image
                        in the final PDF. Might remove desired content.
  -f, --force-ocr       Same as --redo-ocr, but also incorporate the rasterized
                        image in the final PDF.

This would be more accurate, since --force-ocr is not the only argument is "dangerous" in this way, which is what the current help text suggests.

pimlottc avatar Mar 21 '23 04:03 pimlottc