OCRmyPDF Inverted black and white from optimization

Working with a PDF that has only tiff images in it, created with ImageMagick and then assembled into a PDF with img2pdf. Forcing no optimization leaves the images ok. Seems like same result as #419.

Sep 18 '22 14:09 Jmuccigr

Check that you have the latest pikepdf. 5.6.1 introduced a possible fix to some black/white inversion issues.

Sep 18 '22 18:09 jbarlow83

I've got 6.0.2.

Sep 19 '22 10:09 Jmuccigr

Any thoughts?

Sep 25 '22 11:09 Jmuccigr

Thoughts

it's hard to get monochrome right because there are various options to invert that are not always respected by all programs
because of the above, it's hard to investigate without a PDF
you could use qpdf's new --json features as a way of showing me the structure of the PDF without the content
using a heuristic is really tempting
I don't know when I'll have bandwidth

Sep 28 '22 05:09 jbarlow83

Any updates on this issue? I have similar problems and the version of pikepdf is 6.2.1

Oct 24 '22 10:10 alirf81

@alirf81 If you'd like to move things along faster please submit a reproducible example PDF and conmand line.

Oct 24 '22 10:10 jbarlow83

Hi there. Thank you so much for working on and maintaining this project.

I have been experiencing a similar issue: When I try to optimize a particular pdf (without performing OCR) and to have it be converted into a regular pdf (rather than pdf/a), the resulting pdf also inverts black and white. I have tried it on two pdfs (of scanned books) so so far, and it keeps happening to one of them, which has a little bit of a black margin on every other page (don't know if that's relevant). I use the following input:

ocrmypdf --output-type pdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf

If I do it without --output-type pdf everything seems fine.

I am running macOS 12.6.1, and OCRmyPDF 14.0.1; and just homebrew updated/upgraded everything. As you probably can tell I'm not a superuser, so I don't know how to get the structure of pdfs, etc.

If I'm not using the best command to optimize an already ocred pdf and have it saved as a regular pdf, I'd appreciate your help on that as well.

Is there a way to quickly verify whether a pdf is regular or pdf/a on macos, without using, say, Adobe Acrobat?

Many thanks!

Nov 13 '22 15:11 poldy8

Hmm, if I use pdfimages to extract the image from my PDF, it produces a ccitt/params pair which, when I use fax2tiff on, produces the same kind of inverted image. If I tell pdfimages to output a png, the image has the expected colors.

Nov 13 '22 16:11 Jmuccigr

[I had to delete and repost this comment because I made a mistake and uploaded the wrong files. Sorry…]

Here is an example, with everything that lead to its creation. It’s a blank page, but all the pages with text from the same original file created using the same process got inverted in the end.

The original PDF file was A.pdf, but when OCRing it (i.e. the other pages with text in them), the result had spaces between almost each letters, so I decided to extract the images and rebuild a PDF file and reOCR the result.
pdfimages -tiff A.pdf B
img2pdf --output C.pdf B-000.tif
ocrmypdf --language eng --output-type pdf C.pdf D.pdf — The resulting file D.pdf is now correctly OCRed, without spaces between the letters, but white-on-black rather than black-on-white.

Here are all the files, except B-000.tif since GitHub doesn’t allow me to upload it. A.pdf C.pdf D.pdf

Versions:

OCRmyPDF 14.0.1
python-pikepdf 6.2.6
ghostscript 9.56.1
img2pdf 0.4.4
poppler 22.12.0 (for pdfimages)

Jan 19 '23 14:01 vejkse