Input PDF created from TIF file edited/authored with GIMP + img2pdf results in a distorted PDF/A
Describe the bug Bitonal TIF files edited with GIMP (version 2.10.18) and turned into a PDF input file with img2pdf result in pages with strangely malformed output PDFs when run through OCRmyPDF. The TIF files were originally created by ScanTailor-Universal, but only bitonal files edited with GIMP and overwritten with the default export setting seem to exhibit this issue.
OCRmyPDF outputs an error message in red when processing the page, even without verbose output:
1 **** Error: ICCbased space /N value does not match the ICC profile.
Using the number of channels from the profile.
Output may be incorrect.
The problem does not occur with --output-type pdf.
I do have a "fix"/workaround that I've included down below.
To Reproduce
ocrmypdf image.pdf output.pdf
Input PDF was created with img2pdf --imgsize 300dpi --output ../image.pdf *.
Example files
- input file: image.pdf
- my output: bugtest.pdf
- original TIF file (zipped so GitHub will accept it)
- non-edited TIF file, for PDF creation purposes
Expected behavior Output file should look pretty much the same as the input when viewed.
Workaround
Using GraphicsMagick (ImageMagick probably works too) to strip all profile and text attributes from the input images prevents this issue: gm mogrify -strip input.tif.
Reporting this here for starters, even if OCRmyPDF isn't the problem (seems like bad input file to me, since the error warns about possible incorrect output). I'm not too familiar with the program, so I'm not sure which upstream(s) this should be reported to, or if this is a known issue, or if there's some kind of configuration thing I'm missing in some of the software involved.
System
- OS: Linux Mint 20.1 MATE
- OCRmyPDF Version: 11.6.2
- OCRmyPDF was installed via pip. The jbig2 encoder is installed (built from source as per documentation).
- img2pdf version: 0.4.0
- GIMP version: 2.10.18
I also had this issue (see #629 and #636). updating pngquant helped. Exporting the files from gimp in grayscale mode (not binary) also helps.
Thank you for an excellent issue report with all of the details provided.
When I try to open image.pdf, Acrobat "helpfully" complains about "insufficient data for image". I get the same result by using img2pdf 0.4.0 to convert the TIF you provided to PDF.
I think we need img2pdf to produce a valid PDF or reject the input file before we can proceed further, so I suggest raising the issue there.
ImageMagick, for its part, also refuses to convert the file to PNG...
$ convert issue739-input.jp2.tif issue739-input-topng.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `issue739-input-topng.png' @ warning/png.c/MagickPNGWarningHandler/1748.
...so it might be GIMP that is at fault. img2pdf's maintainer josch is terrific, and I'm sure he'll have a good take on this issue.