OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Input PDF created from TIF file edited/authored with GIMP + img2pdf results in a distorted PDF/A

Open veikk0 opened this issue 4 years ago • 2 comments

Describe the bug Bitonal TIF files edited with GIMP (version 2.10.18) and turned into a PDF input file with img2pdf result in pages with strangely malformed output PDFs when run through OCRmyPDF. The TIF files were originally created by ScanTailor-Universal, but only bitonal files edited with GIMP and overwritten with the default export setting seem to exhibit this issue.

OCRmyPDF outputs an error message in red when processing the page, even without verbose output:

    1    **** Error: ICCbased space /N value does not match the ICC profile.                       
                 Using the number of channels from the profile.
                 Output may be incorrect.

Full --verbose 1 log

The problem does not occur with --output-type pdf.

I do have a "fix"/workaround that I've included down below.

To Reproduce

ocrmypdf image.pdf output.pdf

Input PDF was created with img2pdf --imgsize 300dpi --output ../image.pdf *.

Example files

Expected behavior Output file should look pretty much the same as the input when viewed.

Workaround Using GraphicsMagick (ImageMagick probably works too) to strip all profile and text attributes from the input images prevents this issue: gm mogrify -strip input.tif.

Reporting this here for starters, even if OCRmyPDF isn't the problem (seems like bad input file to me, since the error warns about possible incorrect output). I'm not too familiar with the program, so I'm not sure which upstream(s) this should be reported to, or if this is a known issue, or if there's some kind of configuration thing I'm missing in some of the software involved.

System

  • OS: Linux Mint 20.1 MATE
  • OCRmyPDF Version: 11.6.2
  • OCRmyPDF was installed via pip. The jbig2 encoder is installed (built from source as per documentation).
  • img2pdf version: 0.4.0
  • GIMP version: 2.10.18

veikk0 avatar Feb 24 '21 13:02 veikk0

I also had this issue (see #629 and #636). updating pngquant helped. Exporting the files from gimp in grayscale mode (not binary) also helps.

femifrak avatar Feb 25 '21 09:02 femifrak

Thank you for an excellent issue report with all of the details provided.

When I try to open image.pdf, Acrobat "helpfully" complains about "insufficient data for image". I get the same result by using img2pdf 0.4.0 to convert the TIF you provided to PDF.

I think we need img2pdf to produce a valid PDF or reject the input file before we can proceed further, so I suggest raising the issue there.

ImageMagick, for its part, also refuses to convert the file to PNG...

$ convert issue739-input.jp2.tif issue739-input-topng.png
convert: profile 'icc': 'RGB ': RGB color space not permitted on grayscale PNG `issue739-input-topng.png' @ warning/png.c/MagickPNGWarningHandler/1748.

...so it might be GIMP that is at fault. img2pdf's maintainer josch is terrific, and I'm sure he'll have a good take on this issue.

jbarlow83 avatar Feb 26 '21 08:02 jbarlow83