Error in post-processing on "bag of images" PDF
Describe the bug Error report in post processing of PDF that contains png and Group4 tiff files. I can't tell which images are giving a problem, but there are two error reports and two png files created with imagemagick to be 2-bit grayscale. The output PDF seems ok. Here's the error:
While extracting image xref 62, an error occurred
Traceback (most recent call last):
File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/ocrmypdf/optimize.py", line 269, in extract_images
result = extract_fn(
File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/ocrmypdf/optimize.py", line 204, in extract_image_generic
pim.as_pil_image().save(png_name(root, xref))
File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/pikepdf/models/image.py", line 719, in as_pil_image
im = self._extract_transcoded()
File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/pikepdf/models/image.py", line 527, in _extract_transcoded
if self.mode in {'DeviceN', 'Separation'}:
File "/usr/local/Cellar/ocrmypdf/12.7.0_1/libexec/lib/python3.9/site-packages/pikepdf/models/image.py", line 271, in mode
raise NotImplementedError(
NotImplementedError: Not sure how to handle PDF image of this type
To Reproduce I'm just doing a simple:
ocrmypdf -l ita input.pdf output.pdf
Example file I can provide an example off-line.
Expected behavior No errors occur.
System
- OS: macOS
- OCRmyPDF Version: 12.7.0
- Homebrew install
The PDF contains some images formatted as if they were going to be used in a professional print production environment (i.e. specifying non-CMYK inks). Since these are rare in scanned PDFs I don't plan to do anything special to optimize them.
The error here isn't ideal but also not a problem - the optimizer should ignore the file. I will have to make it trap the exception and print a little info image instead, that an image was skipped.
The PDF contains some images formatted as if they were going to be used in a professional print production environment (i.e. specifying non-CMYK inks). Since these are rare in scanned PDFs I don't plan to do anything special to optimize them.
The error here isn't ideal but also not a problem - the optimizer should ignore the file. I will have to make it trap the exception and print a little info image instead, that an image was skipped.
Yeah, different output would be helpful. It does look like something bad has happened.
Would you mind checking if this is fixed on the most recent ocrmypdf and pikepdf? I seem to have misplaced the files you sent.
Same happens on Debian Bookworm, and on all my scanned pictures. Would it be catch an exception somehow? Thanks!
@gladk Please open a new issue and provide as much detail as you can. There are other reasons that error message might be triggered... and it would be really weird for a scanner to produce CMYK images.
I have exactly the same error output described in the initial message. I am attaching just a scanned PDF from my scanner: it is empty but producing such an error. I do not see problem with the recognized document: everything is working as expected. Only this exception is thrown. ScannedDocument.pdf