OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Double to quadruple file size and worse quality with --deskew --clean-final (due to mask?)

Open bllngr opened this issue 2 years ago • 2 comments

Describe the bug Running files from my Samsung M2070 through OCRmyPDF creates much bigger PDFs with worse quality than the original when unpaper and --clean-final is used.

To Reproduce Each of these commands produces files that are much bigger and worse than the original:

$ ocrmypdf -l deu -O1 --deskew --clean-final ./input.pdf ./output-O1-deskew-clean-final.pdf
[...]
The output file size is 4.10× larger than the input file.
[...]

$ ocrmypdf -l deu -O3 --deskew --clean-final ./input.pdf ./output-O3-deskew-clean-final.pdf
[...]
The output file size is 2.51× larger than the input file.
[...]

pdfimages output:

$ pdfimages -list input.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1240  1753  gray    1   8  jpeg   no         4  0   151   151 61.2K 2.9%
   1     1 image     413   584  gray    1   8  jpeg   no         5  0    50    50 4526B 1.9%
   1     2 mask     2481  3507  -       1   1  jpeg   no         5  0   301   301 4526B 0.4%

$ pdfimages -list output-O1-deskew-clean-final.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1240  1755  gray    1   8  jpeg   no        15  0   151   151  538K  25%

$ pdfimages -list output-O3-deskew-clean-final.pdf
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1240  1755  gray    1   8  jpeg   no        15  0   151   151  323K  15%

Note that the input file contains a mask. This may be a similar case to #261 and #269, As far as I can see, no colorspace promotion or transcoding happened, but I'm only guessing.

Example file See attached (input correct.pdf and output files named with the used options).

output-O1-deskew-clean-final.pdf output-O3-deskew-clean-final.pdf input.pdf

Expected behavior Either the quality or the file size stays about the same. Both at the same time is disheartening. ;)

System

  • OS: Ubuntu 20.04 in Windows 10 WSL
  • OCRmyPDF Version: 13.4.3
  • How did you install ocrmypdf? Did you use a system package manager, pip, or a Docker image? Installed with pip in a virtualenv, using the latest versions from pip for all dependencies:
$ pip3 list
Package             Version
------------------- --------
cffi                1.15.0
chardet             4.0.0
coloredlogs         15.0.1
cryptography        36.0.2
humanfriendly       10.0
img2pdf             0.4.4
importlib-resources 5.7.1
lxml                4.8.0
ocrmypdf            13.4.3
packaging           21.3
pdfminer.six        20220319
pikepdf             5.1.2
Pillow              9.1.0
pip                 20.0.2
pkg-resources       0.0.0
pluggy              1.0.0
pycparser           2.21
pyparsing           3.0.8
reportlab           3.6.9
setuptools          44.0.0
tqdm                4.64.0
wheel               0.34.2
zipp                3.8.0

$ ghostscript -v
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc.  All rights reserved.

bllngr avatar Apr 21 '22 07:04 bllngr

This is generally related to not other issues with ocrmypdf not having mixed raster coding. You don't get colorspace promotion mentioned in the issues, but you do get images that covered only a portion of the page to covering all of it.

The mask is weird on this one - it probably should trigger DPI promotion for the underlying image, although I'm not sure if that makes sense, or why that mask shows up as 1-1-jpeg in pdfimages. That's not a valid combination....

jbarlow83 avatar Apr 26 '22 20:04 jbarlow83

Is there a solution for this? This problem still exist in 14.0.4

cyberthom42 avatar Apr 26 '23 11:04 cyberthom42