[Bug]: ocrmypdf fails for a specific pdf file
Describe the bug
When running ocrmypdf on a specific pdf file, it raises an exception at 0% of the step Recompressing JPEGs.
An exception occurred while executing the pipeline
[Traceback, posted below ...]
OSError: image file is truncated (1 bytes not processed)
Steps to reproduce
1. Run `ocrmypdf input.pdf output.pdf`
2. Notice that it fails at Recompressing JPEGs stage
Files
The file is found here: https://annas-archive.org/md5/aee9796ac090fdc8a93fc654f32020f3
How did you download and install the software?
Linux package manager (apt, dnf, etc.)
OCRmyPDF version
16.12.0
Relevant log output
[...]
Optimizable images: JPEGs: 774 PNGs: 0 optimize.py:371
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/774 -:--:--
An exception occurred while executing the pipeline _common.py:296
Traceback (most recent call last):
File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/_common.py", line 261, in
cli_exception_handler
return fn(options, plugin_manager)
File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in
_run_pipeline
optimize_messages = exec_concurrent(context, executor)
File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in
exec_concurrent
pdf, messages = postprocess(pdf, context, executor)
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/_common.py", line 460, in
postprocess
return optimize_pdf(pdf_out, context, executor)
File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipeline.py", line 992, in optimize_pdf
output_pdf, messages = context.plugin_manager.hook.optimize_pdf(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
input_pdf=input_file,
^^^^^^^^^^^^^^^^^^^^^
...<3 lines>...
linearize=should_linearize(input_file, context),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/lib/python3.13/site-packages/pluggy/_hooks.py", line 512, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/pluggy/_callers.py", line 167, in _multicall
raise exception
File "/usr/lib/python3.13/site-packages/pluggy/_callers.py", line 121, in _multicall
res = hook_impl.function(*args)
File "/usr/lib/python3.13/site-packages/ocrmypdf/builtin_plugins/optimize.py", line 145,
in optimize_pdf
result_path = optimize(input_pdf, output_pdf, context, save_settings, executor)
File "/usr/lib/python3.13/site-packages/ocrmypdf/optimize.py", line 727, in optimize
transcode_jpegs(pdf, jpegs, root, options, executor)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/ocrmypdf/optimize.py", line 512, in
transcode_jpegs
executor(
~~~~~~~~^
use_threads=True, # Processes are significantly slower at this task
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<9 lines>...
task_finished=finish_jpeg,
^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/lib/python3.13/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
self._execute(
~~~~~~~~~~~~~^
use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^
...<5 lines>...
task_finished=task_finished,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/usr/lib/python3.13/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line
162, in _execute
result = future.result()
File "/usr/lib/python3.13/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/usr/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.13/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/lib/python3.13/site-packages/ocrmypdf/optimize.py", line 484, in
_optimize_jpeg
im.save(opt_jpg, optimize=True, quality=jpeg_quality)
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.13/site-packages/PIL/Image.py", line 2539, in save
self.load()
~~~~~~~~~^^
File "/usr/lib/python3.13/site-packages/PIL/ImageFile.py", line 391, in load
raise OSError(msg)
OSError: image file is truncated (1 bytes not processed)
We also have this issue with our scanning pipeline after the upgrade to 16.12.0.
We've processed thousands of documents so far without ever seeing this failure and today our pipeline choked with OSError: image file is truncated on two of the 40 files our EPSON document scanner produced today.
So, this very much looks like a regression to use.
Also seeing this on mac today for latest release. Here is the output.
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 240/240 0:00:00
Start processing 12 pages concurrently ocr.py:96
9 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
2 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
5 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
4 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
8 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
3 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
6 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
7 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
83 [tesseract] Image too small to scale!! (2x36 vs min width of 3) tesseract.py:269
83 [tesseract] Line cannot be recognized!! tesseract.py:269
83 [tesseract] Image too small to scale!! (2x36 vs min width of 3) tesseract.py:269
83 [tesseract] Line cannot be recognized!! tesseract.py:269
106 [tesseract] Image too small to scale!! (2x36 vs min width of 3) tesseract.py:269
106 [tesseract] Line cannot be recognized!! tesseract.py:269
130 [tesseract] Image too small to scale!! (2x36 vs min width of 3) tesseract.py:269
130 [tesseract] Line cannot be recognized!! tesseract.py:269
130 [tesseract] Image too small to scale!! (2x36 vs min width of 3) tesseract.py:269
130 [tesseract] Line cannot be recognized!! tesseract.py:269
134 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
167 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
165 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
172 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
214 [tesseract] lots of diacritics - possibly poor OCR tesseract.py:251
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 240/240 0:00:00
Postprocessing... ocr.py:144
PDF/A conversion ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 240/240 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. _metadata.py:63
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/240 -:--:--
An exception occurred while executing the pipeline _common.py:296
Traceback (most recent call last):
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/_common.py", line 261, in cli_exception_handler
return fn(options, plugin_manager)
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline
optimize_messages = exec_concurrent(context, executor)
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in exec_concurrent
pdf, messages = postprocess(pdf, context, executor)
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/_common.py", line 460, in postprocess
return optimize_pdf(pdf_out, context, executor)
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipeline.py", line 992, in optimize_pdf
output_pdf, messages = context.plugin_manager.hook.optimize_pdf(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
input_pdf=input_file,
^^^^^^^^^^^^^^^^^^^^^
...<3 lines>...
linearize=should_linearize(input_file, context),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__
return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
raise exception
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
res = hook_impl.function(*args)
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/builtin_plugins/optimize.py", line 145, in optimize_pdf
result_path = optimize(input_pdf, output_pdf, context, save_settings, executor)
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/optimize.py", line 727, in optimize
transcode_jpegs(pdf, jpegs, root, options, executor)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/optimize.py", line 512, in transcode_jpegs
executor(
~~~~~~~~^
use_threads=True, # Processes are significantly slower at this task
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...<9 lines>...
task_finished=finish_jpeg,
^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
self._execute(
~~~~~~~~~~~~~^
use_threads=use_threads,
^^^^^^^^^^^^^^^^^^^^^^^^
...<5 lines>...
task_finished=task_finished,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 162, in _execute
result = future.result()
File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/_base.py", line 443, in result
return self.__get_result()
~~~~~~~~~~~~~~~~~^^
File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/_base.py", line 395, in __get_result
raise self._exception
File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/thread.py", line 86, in run
result = ctx.run(self.task)
File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/thread.py", line 73, in run
return fn(*args, **kwargs)
File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/optimize.py", line 484, in _optimize_jpeg
im.save(opt_jpg, optimize=True, quality=jpeg_quality)
~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/opt/pillow/lib/python3.14/site-packages/PIL/Image.py", line 2539, in save
self.load()
~~~~~~~~~^^
File "/opt/homebrew/opt/pillow/lib/python3.14/site-packages/PIL/ImageFile.py", line 391, in load
raise OSError(msg)
OSError: image file is truncated (6 bytes not processed```
I determined that the issue is in Ghostscript 10.6 and reported it here: https://bugs.ghostscript.com/show_bug.cgi?id=708961
For now, you can work around the issue by downgrading to an earlier Ghostscript or using
ocrmypdf --output-type pdf which does not use Ghostscript to create a PDF/A, thereby avoiding the issue.
Thanks for the quick response. Using --output-type pdf was good enough for my use case. Appreciate it.
Fixed in 16.13.0 (hopefully) with detection and repair of images corrupted by Ghostscript.