OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

ocrmypdf --redo-ocr fails with DecompressionBombError on small PDF

Open nicolasguinot opened this issue 4 years ago • 4 comments

Describe the bug My PDF is 900kb When trying to --redo-ocr on a previously OCRized PDF I am getting a PIL.Image.DecompressionBombError, program thinks my PDF is 270 245 400 pixels while pdfimages -list shows its a 8 746 100 pixels PDF.

To Reproduce First ocrmypdf file1.pdf ocr.pdf then ocrmypdf --redo-ocr ocr.pdf redo.pdf

Expected behavior I didn't expect it to crash since my file is less than 256 Mpixels If I try to do : ocrmypdf --force-ocr --max-image-mpixels=270 ocr.pdf force.pdf it works but it takes 150 seconds instead of 9 seconds, and my resulted file is 13 times bigger also the resulting resolution is (13800, 19583) with 400 dpi from initially a resolution of (2484, 3525) with 72 dpi.

Screenshots Initial pdf and ocrized pdf inspection with pdfimages -list file1.pdf 0FBDE5EF-501F-41D6-A118-889B811F2E5A

pdfimages -list ocr.pdf on OCRized PDF 7DDDB86B-20DB-4B9E-8645-46D22B23B608

trying --redo-ocr: 8D8F1E36-7AF8-46BC-B9EE-D49B70E82DE1

After --force-ocr --max-image-mpixels=270 on previously OCRized PDF pdfimages -list force.pdf: F380E027-E8BF-4A93-A27F-8228FD25E226

System (please complete the following information):

  • OS: Debian GNU/Linux 11
  • Python version: 3.7.12
  • OCRmyPDF version: 12.4.0

Installation Used apt-get install ocrmypdf to install

Note I can get it to work by passing --max-image-mpixels=270 --redo-ocr but I can't afford to wait 13 times more than normal time and I don't want to pass this parameter to the program. I would like to make it work without --max-image-mpixels

nicolasguinot avatar Sep 10 '21 13:09 nicolasguinot

Decompression bomb check are designed behavior, to protect users against malicious files. We use a higher limit than Pillow's 75M pixels.

As the issue template indicates, if no example file is provided, usually no further investigation can be attempted. With an example file, it is possible to determine if something went wrong and improve behavior. Without a way of triggering the behavior you observed, I can only guess. Probably there's a vector object in the PDF that causes promotion to 400 dpi to avoid loss of resolution, but again, that is just a guess.

If you installed with apt-get on Debian you probably have ocrmypdf 10.3 or so; Debian has not updated in a long time.

jbarlow83 avatar Sep 16 '21 09:09 jbarlow83

If you installed with apt-get on Debian you probably have ocrmypdf 10.3 or so; Debian has not updated in a long time.

Issue description says Debian 11 with OCRmyPDF 12.4.0

mara004 avatar Oct 07 '21 18:10 mara004

Sorry for not responding sooner, the version is indeed the 12.4.0 I'm not allowed to provide the PDFs to reproduce the issue, I've tried to make a test PDF that I could provide you but the error didn't happen. When I have some more time I will try to provide you with a test PDF where the issue happen

nicolasguinot avatar Oct 08 '21 12:10 nicolasguinot

If you send me the output of qpdf --json yourprivate.pdf (using the qpdf program) that would give some indicating of the type of content in your file without actually reveal its contents. Might be enough in this case.

jbarlow83 avatar Oct 20 '21 06:10 jbarlow83