OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Error in post-processing on "bag of images" PDF

Open Jmuccigr opened this issue 4 years ago • 6 comments

Describe the bug Error report in post processing of PDF that contains png and Group4 tiff files. I can't tell which images are giving a problem, but there are two error reports and two png files created with imagemagick to be 2-bit grayscale. The output PDF seems ok. Here's the error:

While extracting image xref 62, an error occurred
Traceback (most recent call last):
  File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/ocrmypdf/optimize.py", line 269, in extract_images
    result = extract_fn(
  File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/ocrmypdf/optimize.py", line 204, in extract_image_generic
    pim.as_pil_image().save(png_name(root, xref))
  File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/pikepdf/models/image.py", line 719, in as_pil_image
    im = self._extract_transcoded()
  File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
    if self.mode in {'DeviceN', 'Separation'}:
  File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/pikepdf/models/image.py", line 271, in mode
    raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type

To Reproduce I'm just doing a simple:

ocrmypdf -l ita input.pdf output.pdf

Example file I can provide an example off-line.

Expected behavior No errors occur.

System

  • OS: macOS
  • OCRmyPDF Version: 12.7.0
  • Homebrew install

Jmuccigr avatar Oct 15 '21 16:10 Jmuccigr

The PDF contains some images formatted as if they were going to be used in a professional print production environment (i.e. specifying non-CMYK inks). Since these are rare in scanned PDFs I don't plan to do anything special to optimize them.

The error here isn't ideal but also not a problem - the optimizer should ignore the file. I will have to make it trap the exception and print a little info image instead, that an image was skipped.

jbarlow83 avatar Oct 16 '21 07:10 jbarlow83

The PDF contains some images formatted as if they were going to be used in a professional print production environment (i.e. specifying non-CMYK inks). Since these are rare in scanned PDFs I don't plan to do anything special to optimize them.

The error here isn't ideal but also not a problem - the optimizer should ignore the file. I will have to make it trap the exception and print a little info image instead, that an image was skipped.

Yeah, different output would be helpful. It does look like something bad has happened.

Jmuccigr avatar Oct 18 '21 13:10 Jmuccigr

Would you mind checking if this is fixed on the most recent ocrmypdf and pikepdf? I seem to have misplaced the files you sent.

jbarlow83 avatar Dec 05 '21 08:12 jbarlow83

Same happens on Debian Bookworm, and on all my scanned pictures. Would it be catch an exception somehow? Thanks!

gladk avatar Dec 27 '21 10:12 gladk

@gladk Please open a new issue and provide as much detail as you can. There are other reasons that error message might be triggered... and it would be really weird for a scanner to produce CMYK images.

jbarlow83 avatar Dec 27 '21 10:12 jbarlow83

I have exactly the same error output described in the initial message. I am attaching just a scanned PDF from my scanner: it is empty but producing such an error. I do not see problem with the recognized document: everything is working as expected. Only this exception is thrown. ScannedDocument.pdf

gladk avatar Dec 27 '21 10:12 gladk