OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Inverted black and white from optimization

Open Jmuccigr opened this issue 3 years ago • 9 comments

Working with a PDF that has only tiff images in it, created with ImageMagick and then assembled into a PDF with img2pdf. Forcing no optimization leaves the images ok. Seems like same result as #419.

Jmuccigr avatar Sep 18 '22 14:09 Jmuccigr

Check that you have the latest pikepdf. 5.6.1 introduced a possible fix to some black/white inversion issues.

jbarlow83 avatar Sep 18 '22 18:09 jbarlow83

I've got 6.0.2.

Jmuccigr avatar Sep 19 '22 10:09 Jmuccigr

Any thoughts?

Jmuccigr avatar Sep 25 '22 11:09 Jmuccigr

Thoughts

  • it's hard to get monochrome right because there are various options to invert that are not always respected by all programs
  • because of the above, it's hard to investigate without a PDF
  • you could use qpdf's new --json features as a way of showing me the structure of the PDF without the content
  • using a heuristic is really tempting
  • I don't know when I'll have bandwidth

jbarlow83 avatar Sep 28 '22 05:09 jbarlow83

Any updates on this issue? I have similar problems and the version of pikepdf is 6.2.1

alirf81 avatar Oct 24 '22 10:10 alirf81

@alirf81 If you'd like to move things along faster please submit a reproducible example PDF and conmand line.

jbarlow83 avatar Oct 24 '22 10:10 jbarlow83

Hi there. Thank you so much for working on and maintaining this project.

I have been experiencing a similar issue: When I try to optimize a particular pdf (without performing OCR) and to have it be converted into a regular pdf (rather than pdf/a), the resulting pdf also inverts black and white. I have tried it on two pdfs (of scanned books) so so far, and it keeps happening to one of them, which has a little bit of a black margin on every other page (don't know if that's relevant). I use the following input:

ocrmypdf --output-type pdf --tesseract-timeout=0 --optimize 3 --skip-text input.pdf output.pdf

If I do it without --output-type pdf everything seems fine.

I am running macOS 12.6.1, and OCRmyPDF 14.0.1; and just homebrew updated/upgraded everything. As you probably can tell I'm not a superuser, so I don't know how to get the structure of pdfs, etc.

If I'm not using the best command to optimize an already ocred pdf and have it saved as a regular pdf, I'd appreciate your help on that as well.

Is there a way to quickly verify whether a pdf is regular or pdf/a on macos, without using, say, Adobe Acrobat?

Many thanks!

poldy8 avatar Nov 13 '22 15:11 poldy8

Hmm, if I use pdfimages to extract the image from my PDF, it produces a ccitt/params pair which, when I use fax2tiff on, produces the same kind of inverted image. If I tell pdfimages to output a png, the image has the expected colors.

Jmuccigr avatar Nov 13 '22 16:11 Jmuccigr

[I had to delete and repost this comment because I made a mistake and uploaded the wrong files. Sorry…]

Here is an example, with everything that lead to its creation. It’s a blank page, but all the pages with text from the same original file created using the same process got inverted in the end.

  1. The original PDF file was A.pdf, but when OCRing it (i.e. the other pages with text in them), the result had spaces between almost each letters, so I decided to extract the images and rebuild a PDF file and reOCR the result.
  2. pdfimages -tiff A.pdf B
  3. img2pdf --output C.pdf B-000.tif
  4. ocrmypdf --language eng --output-type pdf C.pdf D.pdf — The resulting file D.pdf is now correctly OCRed, without spaces between the letters, but white-on-black rather than black-on-white.

Here are all the files, except B-000.tif since GitHub doesn’t allow me to upload it. A.pdf C.pdf D.pdf

Versions:

  • OCRmyPDF 14.0.1
  • python-pikepdf 6.2.6
  • ghostscript 9.56.1
  • img2pdf 0.4.4
  • poppler 22.12.0 (for pdfimages)

vejkse avatar Jan 19 '23 14:01 vejkse