OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

ColorSpace-Indexed-ICCBased-DeviceGray converted to RGB

Open drboone opened this issue 6 years ago • 2 comments

Describe the issue I'm reporting this much larger output file as requested by the program. If I extract all of the scanned page images from the attached pdf using pdfimages, they come out as .pbm files. However, if I do the same to the pdf produced by ocrmypdf, they come out as .ppm files. Hopefully the attachmed pdf helps you track down whatever bizarre case I've managed to create.

   INFO -    4: [tesseract] Image too small to scale!! (2x36 vs min width of 3)
   INFO -    4: [tesseract] Line cannot be recognized!!
   INFO -    4: [tesseract] Image too small to scale!! (2x36 vs min width of 3)
   INFO -    4: [tesseract] Line cannot be recognized!!
WARNING -    3: [tesseract] lots of diacritics - possibly poor OCR
   INFO - Optimize ratio: 1.00 savings: 0.0%
   INFO - Output file is a PDF/A-2B (as expected)
WARNING - The output file size is 2.25× larger than the input file.
No reason for this increase is known.  Please report this issue.

To Reproduce ocrmypdf "1st Solutions July 1985.pdf" out/"1st Solutions July 1985,pdf

Example file Culprit pdf is attached

Please check any or all that apply about the test file:

  • [x] This is the input file
  • [x] The file contains no personal or confidential information
  • [ ] I am the copyright holder for this file
  • [ ] I permit this file to be included in the OCRmyPDF test suite under the CC-BY-SA 4.0 license
  • [ ] I am not the copyright holder, but this file is available under a free software license

Expected behavior A clear and concise description of what you expected to happen. Include screenshots if applicable.

System:

  • OS: Linux (debian sid, kernel 4.16.0-2)
  • OCRmyPDF Version: 8.1.0

Additional context Add any other context about the problem here. 1st Solutions July 1985.pdf

drboone avatar Feb 24 '19 23:02 drboone

Thank you.

The issue is that the images are marked as having a complex colorspace that ocrmypdf does not recognize, so it takes the precaution of assuming the colorspace is RGB and upgrades all of the images from monochrome to RGB.

You could work around this with pdfimages by outputting to monochrome and then repacking as a PDF.

Not sure when I'll be able to address this.

jbarlow83 avatar Feb 25 '19 11:02 jbarlow83

Yes, I rebuilt the PDF trivially. Thanks for looking!

drboone avatar Feb 28 '19 22:02 drboone