OCRmyPDF Check if OCR images would be >2^31 bytes

Check if OCR images would be >2^31 bytes

Open jbarlow83 opened this issue 4 years ago • 4 comments

Leptonica cannot handle images more than 2^31 bytes for its internal buffer and it also needs 4 bytes per pixel. Limit of about 23000x23000 pixels.

We should avoid situations that would produce an image of this size. If set to oversampling, we can back it off...

Error in pixCreateHeader: requested w = 29529, h = 19225, d = 32
ERROR -  213: [tesseract] Error in pixCreateHeader: requested bytes >= 2^31

Sep 19 '19 19:09 jbarlow83

Is there a setting/parameter we can specify in the API if we're hitting this?

Jun 21 '21 15:06 Olshansk

No, it's a limit in Leptonica and Tesseract, and I believe Leptonica needs to be recompiled because it's a #define, and Tesseract may need it's own adjustments.

Although you could check if there's a vector image or some high dpi resource that's causing an intermediate image to be created that is very large and see if you can turn that down. See the --help for knobs to twiddle.

Jun 21 '21 19:06 jbarlow83

Haven't had any luck, but figured I'd reply for posterity.

I looked through all the available input parameters here and image_dpi was still hitting the same issue.

Jun 22 '21 02:06 Olshansk

Try oversample to see if you're doing something that is causing oversampling. The output of -v1 may help too.

Also for posterity, the deal-breaker in Leptonica is here. Not a define, it's a literal. https://github.com/DanBloomberg/leptonica/blob/dcb0096b1e8cfe431f63923cf1a9a674291e36f5/src/pix1.c#L532

I think it's a decent guess that Leptonica has 32-bit assumptions elsewhere in its code too and that check is just the main guard. Realistically we need to downsample or partition the image.

Jun 22 '21 05:06 jbarlow83

Added as of v14.1

Jun 02 '23 08:06 jbarlow83

OCRmyPDF OCRmyPDF copied to clipboard

Check if OCR images would be >2^31 bytes

OCRmyPDF
OCRmyPDF copied to clipboard