OCRmyPDF
OCRmyPDF copied to clipboard
Check if OCR images would be >2^31 bytes
Leptonica cannot handle images more than 2^31 bytes for its internal buffer and it also needs 4 bytes per pixel. Limit of about 23000x23000 pixels.
We should avoid situations that would produce an image of this size. If set to oversampling, we can back it off...
Error in pixCreateHeader: requested w = 29529, h = 19225, d = 32
ERROR - 213: [tesseract] Error in pixCreateHeader: requested bytes >= 2^31
Is there a setting/parameter we can specify in the API if we're hitting this?
No, it's a limit in Leptonica and Tesseract, and I believe Leptonica needs to be recompiled because it's a #define, and Tesseract may need it's own adjustments.
Although you could check if there's a vector image or some high dpi resource that's causing an intermediate image to be created that is very large and see if you can turn that down. See the --help for knobs to twiddle.
Haven't had any luck, but figured I'd reply for posterity.
I looked through all the available input parameters here and image_dpi
was still hitting the same issue.
Try oversample
to see if you're doing something that is causing oversampling. The output of -v1
may help too.
Also for posterity, the deal-breaker in Leptonica is here. Not a define, it's a literal. https://github.com/DanBloomberg/leptonica/blob/dcb0096b1e8cfe431f63923cf1a9a674291e36f5/src/pix1.c#L532
I think it's a decent guess that Leptonica has 32-bit assumptions elsewhere in its code too and that check is just the main guard. Realistically we need to downsample or partition the image.
Added as of v14.1