OCRmyPDF assumes really large DPI for native PDF when rasterizing as image
I use OCRmyPDF for processing lots of "native" PDFs (with that I mean PDFs generated by Word, etc.).
Due to some constraints a lot of these PDFs have to be processed with the --force-ocr flag enabled. This leads to cases where PDFs contain a logo / image that causes the rasterized pages to be rendered with an insanely high DPI. Some DIN A4 pages are approx. 16k x 22k pixels which is way too large for Tesseract to handle if I want to have decent resource usage.
Before using OCRmyPDF I manually used Tesseract and Ghostscript (or similar) and always rendered documents at 300 DPI, which was enough for Tesseract to return fine results.
When debugging this problem, I figured that what OCRmyPDF might lack is a "--max-dpi" flag which limits the DPI used when rasterizing PDFs. AFAIK there is no such flag, is there?
You can use the --oversample 300 parameter to rasterize your PDFs at 300 DPI.
Fixed in v15
Any way to set any limits on this?
We have a 5MiB file that trigger OOM error with 14GiB of ram. The allocations seems to be related to rasterizing and PIL.
setting skip-big and max-image-mpixels only gives me a warning, and then it seems ocrmypdf just does it anyways as I don't notice a difference in runtime or memory usage.