OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

OCRmyPDF assumes really large DPI for native PDF when rasterizing as image

Open fabiante opened this issue 3 years ago • 1 comments

I use OCRmyPDF for processing lots of "native" PDFs (with that I mean PDFs generated by Word, etc.).

Due to some constraints a lot of these PDFs have to be processed with the --force-ocr flag enabled. This leads to cases where PDFs contain a logo / image that causes the rasterized pages to be rendered with an insanely high DPI. Some DIN A4 pages are approx. 16k x 22k pixels which is way too large for Tesseract to handle if I want to have decent resource usage.

Before using OCRmyPDF I manually used Tesseract and Ghostscript (or similar) and always rendered documents at 300 DPI, which was enough for Tesseract to return fine results.

When debugging this problem, I figured that what OCRmyPDF might lack is a "--max-dpi" flag which limits the DPI used when rasterizing PDFs. AFAIK there is no such flag, is there?

fabiante avatar Aug 08 '22 08:08 fabiante

You can use the --oversample 300 parameter to rasterize your PDFs at 300 DPI.

hampoelz avatar Aug 28 '22 18:08 hampoelz

Fixed in v15

jbarlow83 avatar Sep 26 '23 19:09 jbarlow83

Any way to set any limits on this?

We have a 5MiB file that trigger OOM error with 14GiB of ram. The allocations seems to be related to rasterizing and PIL.

setting skip-big and max-image-mpixels only gives me a warning, and then it seems ocrmypdf just does it anyways as I don't notice a difference in runtime or memory usage.

andersfylling avatar Jan 19 '24 19:01 andersfylling