OCRmyPDF
OCRmyPDF copied to clipboard
Double to quadruple file size and worse quality with --deskew --clean-final (due to mask?)
Describe the bug
Running files from my Samsung M2070 through OCRmyPDF creates much bigger PDFs with worse quality than the original when unpaper and --clean-final
is used.
To Reproduce Each of these commands produces files that are much bigger and worse than the original:
$ ocrmypdf -l deu -O1 --deskew --clean-final ./input.pdf ./output-O1-deskew-clean-final.pdf
[...]
The output file size is 4.10× larger than the input file.
[...]
$ ocrmypdf -l deu -O3 --deskew --clean-final ./input.pdf ./output-O3-deskew-clean-final.pdf
[...]
The output file size is 2.51× larger than the input file.
[...]
pdfimages
output:
$ pdfimages -list input.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1240 1753 gray 1 8 jpeg no 4 0 151 151 61.2K 2.9%
1 1 image 413 584 gray 1 8 jpeg no 5 0 50 50 4526B 1.9%
1 2 mask 2481 3507 - 1 1 jpeg no 5 0 301 301 4526B 0.4%
$ pdfimages -list output-O1-deskew-clean-final.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1240 1755 gray 1 8 jpeg no 15 0 151 151 538K 25%
$ pdfimages -list output-O3-deskew-clean-final.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1240 1755 gray 1 8 jpeg no 15 0 151 151 323K 15%
Note that the input file contains a mask. This may be a similar case to #261 and #269, As far as I can see, no colorspace promotion or transcoding happened, but I'm only guessing.
Example file
See attached (input correct.pdf
and output files named with the used options).
output-O1-deskew-clean-final.pdf output-O3-deskew-clean-final.pdf input.pdf
Expected behavior Either the quality or the file size stays about the same. Both at the same time is disheartening. ;)
System
- OS: Ubuntu 20.04 in Windows 10 WSL
- OCRmyPDF Version: 13.4.3
- How did you install ocrmypdf? Did you use a system package manager,
pip
, or a Docker image? Installed withpip
in a virtualenv, using the latest versions from pip for all dependencies:
$ pip3 list
Package Version
------------------- --------
cffi 1.15.0
chardet 4.0.0
coloredlogs 15.0.1
cryptography 36.0.2
humanfriendly 10.0
img2pdf 0.4.4
importlib-resources 5.7.1
lxml 4.8.0
ocrmypdf 13.4.3
packaging 21.3
pdfminer.six 20220319
pikepdf 5.1.2
Pillow 9.1.0
pip 20.0.2
pkg-resources 0.0.0
pluggy 1.0.0
pycparser 2.21
pyparsing 3.0.8
reportlab 3.6.9
setuptools 44.0.0
tqdm 4.64.0
wheel 0.34.2
zipp 3.8.0
$ ghostscript -v
GPL Ghostscript 9.50 (2019-10-15)
Copyright (C) 2019 Artifex Software, Inc. All rights reserved.
This is generally related to not other issues with ocrmypdf not having mixed raster coding. You don't get colorspace promotion mentioned in the issues, but you do get images that covered only a portion of the page to covering all of it.
The mask is weird on this one - it probably should trigger DPI promotion for the underlying image, although I'm not sure if that makes sense, or why that mask shows up as 1-1-jpeg in pdfimages. That's not a valid combination....
Is there a solution for this? This problem still exist in 14.0.4