OCRmyPDF optimize.py doesn't process images with subtype Form

optimize.py doesn't process images with subtype Form

Open imz opened this issue 2 years ago • 1 comments

Describe the bug

After rearranging pages with pdfjam in a scanned document, the resulting file with images cannot be optimized, because the image type is unexpected (/Form).

To Reproduce

A complete example was described at https://github.com/qpdf/qpdf/issues/712.

The relevant files are attached:

scan.pdf pages_jam.pdf

Here I repeat the description of my example:

As for recompression, a script can be written on top of pikepdf (the Python module often mentioned here in the issues).

And actually a tool that uses pikepdf, ocrmypdf, already has a module for this purpose: optimize.py.

One can either tweak the options of ocrmypdf not to do the OCR, but just the recompression (see below in my example with a Makefile). Or, the file optimize.py is actually almost ready to be run as a standalone script.

First, to run it as a standalone script, one has to copy it out of the package tree (otherwise, some imports try to import a wrong module from the same package instead of taking the global one: subprocess.py is present in both places, and if you run the script from the package's tree, it first finds the one in the package).

Then I made small modifications to the script to make it more convenient:

--- OCRmyPDF/src/ocrmypdf/optimize.py	2022-06-08 06:21:51.659092559 +0300
+++ optimize.py	2022-06-08 07:15:52.993430611 +0300
@@ -46,7 +46,7 @@
 from ocrmypdf.exceptions import OutputFileAccessError
 from ocrmypdf.helpers import IMG2PDF_KWARGS, safe_symlink
 
-log = logging.getLogger(__name__)
+log = logging.getLogger("ocrmypdf")
 
 DEFAULT_JPEG_QUALITY = 75
 DEFAULT_PNG_QUALITY = 70
@@ -78,7 +78,10 @@
     del pike  # unused args
     del root
 
+    log.debug(f"start extract_image_filter xref {xref}")
+
     if image.Subtype != Name.Image:
+        log.debug(f"extract_image_filter xref {xref}: {image.Subtype} != Name.Image")
         return None
     if image.Length < 100:
         log.debug(f"xref {xref}: skipping image with small stream size")
@@ -278,6 +281,7 @@
 
     working_xrefs = include_xrefs - exclude_xrefs
     for xref in working_xrefs:
+        log.debug(f"about to extract xref {xref}")
         image = pike.get_object((xref, 0))
         try:
             result = extract_fn(
@@ -677,9 +681,12 @@
         safe_symlink(target_file, output_file)
 
 
-def main(infile, outfile, level, jobs=1):
+def main(infile, outfile, level, jpegq, pngq, jobs=1):
     from shutil import copy  # pylint: disable=import-outside-toplevel
     from tempfile import TemporaryDirectory  # pylint: disable=import-outside-toplevel
+    from ocrmypdf.api import Verbosity, configure_logging
+
+    configure_logging(Verbosity.debug)
 
     class OptimizeOptions:
         """Emulate ocrmypdf's options"""
@@ -702,8 +709,8 @@
         input_file=infile,
         jobs=jobs,
         optimize_=int(level),
-        jpeg_quality=0,  # Use default
-        png_quality=0,
+        jpeg_quality=int(jpegq),  # 0 for default
+        png_quality=int(pngq),
         jb2lossy=False,
     )
 
@@ -724,4 +731,4 @@
 
 
 if __name__ == '__main__':
-    main(sys.argv[1], sys.argv[2], sys.argv[3])
+    main(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4], sys.argv[5])

And here is my example where I test how it works in the form of a Makefile:

ALL := $(foreach join,jam qpdf,\
	$(foreach recompr,ocrmypdf optimize,\
		pages_$(join)_$(recompr).pdf))

all: $(ALL)

clean::
	rm -fv $(ALL)

# Two methods to recompress:

%_ocrmypdf.pdf: %.pdf
	ocrmypdf --output-type pdf \
		--verbose 2 \
		--skip-big 1 \
		--optimize 3 --jpeg-quality 5 \
		$< -- \
		$@

%_optimize.pdf: %.pdf
	python3 optimize.py $< $@ 2 5 0

# Two methods to create a PDF by joining pages:

pages_jam.pdf: scan.pdf
	pdfjam --fitpaper true \
		$^ $^ \
		--landscape --angle 270 \
		--outfile $@

clean::
	rm -fv pages_jam.pdf

pages_qpdf.pdf: scan.pdf
	qpdf --empty \
		--pages $^ $^ -- \
		--rotate=90 \
		$@

clean::
	rm -fv pages_qpdf.pdf

The original scan.pdf will be attached so that you can re-run my example.

scan.pdf

The results:

$ du -shL scan.pdf pages*.pdf
112K	scan.pdf
212K	pages_jam_ocrmypdf.pdf
212K	pages_jam_optimize.pdf
212K	pages_jam.pdf
28K	pages_qpdf_ocrmypdf.pdf
24K	pages_qpdf_optimize.pdf
108K	pages_qpdf.pdf

If the processed file was created by qpdf, then recompression works.

If by pdfjam, then not. It's due to the objects containing the images having some different (unexpected) type in the PDF:

$ make pages_jam_optimize.pdf
python3 optimize.py pages_jam.pdf pages_jam_optimize.pdf 2 5 0
xref 7: treating as an optimization candidate
xref 82: treating as an optimization candidate
about to extract xref 82
start extract_image_filter xref 82
extract_image_filter xref 82: /Form != Name.Image
about to extract xref 7
start extract_image_filter xref 7
extract_image_filter xref 7: /Form != Name.Image
Optimizable images: JPEGs: 0 PNGs: 0
xref 7: treating as an optimization candidate
xref 82: treating as an optimization candidate
about to extract xref 82
start extract_image_filter xref 82
extract_image_filter xref 82: /Form != Name.Image
about to extract xref 7
start extract_image_filter xref 7
extract_image_filter xref 7: /Form != Name.Image
xref 7: treating as an optimization candidate
xref 82: treating as an optimization candidate
about to extract xref 82
start extract_image_filter xref 82
extract_image_filter xref 82: /Form != Name.Image
about to extract xref 7
start extract_image_filter xref 7
extract_image_filter xref 7: /Form != Name.Image
Optimizable images: JBIG2 groups: 0
Optimize ratio: 1.01 savings: 0.5%
os.symlink(/tmp/tmpsidiaa4n/out.opt.pdf, /tmp/tmpsidiaa4n/out.pdf)

Note the debugging message:

extract_image_filter xref 82: /Form != Name.Image

that explains the failure to find and process those images.

The same problem on the recompression stage also arises if the main tool ocrmypdf is run on this input file pages_jam.pdf (the way it is done in the Makefile, for instance).

Jun 08 '22 09:06 imz

A /Form is a "Form XObject" or a group object that can contain one or more images among other object types, including recursively other forms. A /Form is not an image. A form might have /Resources, and some of those resources might be images. If ocrmypdf does not check the images attached to forms then that is indeed a missed optimization opportunity. Forms themselves cannot be optimized.

I believe the underlying issue is that enumerating pdf.images through pikepdf/qpdf does not find images that are contained in Form XObjects.

Jun 09 '22 08:06 jbarlow83

Is there a fix for optimizing images inside Form XObjects?

Feb 21 '23 13:02 benbro

Upstream issue is fixed. Will it just be supported or should we change something?

May 22 '23 03:05 benbro

Fixed in v14.2.1 (independent of pikepdf version)

May 23 '23 21:05 jbarlow83

OCRmyPDF OCRmyPDF copied to clipboard

optimize.py doesn't process images with subtype Form

OCRmyPDF
OCRmyPDF copied to clipboard