OCRmyPDF
OCRmyPDF copied to clipboard
Correct way to deskew PDF already processed by OCRmyPDF?
Describe the question
What is the correct way to deskew a PDF that has already been OCR'd by OCRmyPDF? OCRmyPDF complains whenever the input PDF already has OCR'd text. --redo-ocr
sounds like the most appropriate option but it listed as not compatible with --deskew
.
Rationale I have an automated workflow to OCR documents after scanning them from my desktop scanner. I often want to deskew documents from my scanner that are a little off but sometimes deskew messes up a document that was already fine. So I'd rather have just all documents OCR'd automatically and then manual deskew if it seems like it actually needs it.
To Reproduce
- Run OCR:
ocrmypdf receipt.pdf receipt.pdf
- View document and decide it needs to be deskewed
- Run again:
ocrmypdf --deskew --redo-ocr receipt.pdf receipt.pdf
Expected behavior
Output document is deskewed and OCR'd with the same result as if I had included --deskew
the first time.
Actual behavior
% ocrmypdf --deskew --redo-ocr receipt.pdf receipt.pdf
--redo-ocr is not currently compatible with --deskew, --clean-final, and --remove-background
Screenshots If applicable, add screenshots to help explain your problem.
System (please complete the following information):
- OS: macOS
- Python version: 3.11
- OCRmyPDF version: 14.0.3
- Platform: ARM (Apple M1)
Installation
brew install ocrmypdf
I also see that --force-ocr
is another option that is supported by --deskew
; however the description is less clear and make it sounds like force-ocr
is potentially lossy and/or destructive, as it describes rasterizing text and vector elements and "rewriting the PDF":
Rasterize any text or vector objects on each page, apply OCR, and save the rastered output (this rewrites the PDF)
Exactly what "rewriting the PDF" means is not clear to me; isn't the PDF /always/ rewritten, since it's adding/updating an OCR layer? More likely it's talking about rewriting portions of the PDF aside from the OCR layer. If that's the case, wouldn't options like --deskew
and --clean-final
be doing that too? But the descriptions for those don't say anything about that; --force-ocr
is the only option that is specifically called out as "rewriting the PDF".
Further adding to my confusion is that running OCR + deskew in two steps (via --force-ocr
) results in a slightly larger file than running OCR + deskew in one step:
In 1 step:
% ocrmypdf --deskew receipt.orig.pdf receipt.ocr-deskew.pdf
In 2 steps:
% ocrmypdf receipt.orig.pdf receipt.ocr.pdf
% ocrmypdf --deskew --force-ocr receipt.ocr.pdf receipt.ocr-deskew-force-ocr.pdf
Results:
-rw-r--r-- 1 chris staff 477262 Mar 18 11:43 receipt.orig.pdf
-rw-r--r-- 1 chris staff 187114 Mar 18 11:44 receipt.ocr-deskew.pdf
-rw-r--r-- 1 chris staff 363849 Mar 18 11:44 receipt.ocr.pdf
-rw-r--r-- 1 chris staff 192836 Mar 18 11:45 receipt.ocr-deskew-force-ocr.pdf
This makes me less sure that it's doing the same thing.
--redo-ocr
by design requires the image to remain exactly the same -- which conflicts with deskewing any other operation that alters the page image.
--force-ocr
asks ocrmypdf to rasterize an image of the page and mostly discard the original contents. The raster image often ends up being larger than the original. (For example, a page that was 400 dpi black and white on a 200 dpi color background will be upgraded to 400 dpi color.)
--force-ocr --deskew
is the best available option to re-process. You can try more aggressive optimization to reduce file size after the fact.
--force-ocr asks ocrmypdf to rasterize an image of the page and mostly discard the original contents. The raster image often ends up being larger than the original. (For example, a page that was 400 dpi black and white on a 200 dpi color background will be upgraded to 400 dpi color.)
Is this any different than what happens when you do ocrmypdf --deskew
on a PDF that doesn't have an OCR layer?
ocrmypdf --deskew
will not run on a PDF that contains text. If the PDF contains text we exit with an error.
ocrmypdf --deskew --skip-text
will only deskew pages that contain no text. (Useful when you have say, a PDF generated from Word with scanned pages attached.
ocrmypdf --deskew --force-ocr
will deskew everything.
Okay, I understand that, my question is whether --deskew
on a PDF without text also ends up doing the same rasterization as --force-ocr
(which as you say could change the DPI and increase the filesize).
Basically, I want to get as close to same result with deskewing a PDF that was already OCR'd by OCRmYPDF as I would by running --deskew
the first time. Ideally the original image data is alternated to the minimum amount necessary.
Obviously just running --deskew
must be changing the image data but the description of --force-ocr
makes it sounds like it might be more extreme.
Okay, I spent some time reading the code, please correct me if I'm wrong here.
Based on the definition of options.lossless_reconstruction
, full rasterization (with all the effects your described earlier) occurs when ANY of the follow options are specified:
-
--force-ocr
-
--deskew
-
--clean-final
-
--remove-background
So really, the warning text in --help
for --force-ocr
applies to any of these arguments. Perhaps they should all have a marker like *LOSSY*
or be in their own section, e.g.
Lossy options:
Using these options will cause the visual data of the PDF to be rewritten.
Any text or vector objects will be rasterized [..]
--remove-background Attempt to remove background from gray or color pages,
setting it to white.
-d, --deskew Deskew each page before performing OCR.
-i, --clean-final Same as --clean, but also incorporate the cleaned image
in the final PDF. Might remove desired content.
-f, --force-ocr Same as --redo-ocr, but also incorporate the rasterized
image in the final PDF.
This would be more accurate, since --force-ocr
is not the only argument is "dangerous" in this way, which is what the current help text suggests.