OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

[Bug]: ocrmypdf fails for a specific pdf file

Open dryBoneMarrow opened this issue 1 month ago • 4 comments

Describe the bug

When running ocrmypdf on a specific pdf file, it raises an exception at 0% of the step Recompressing JPEGs.

An exception occurred while executing the pipeline
[Traceback, posted below ...]
OSError: image file is truncated (1 bytes not processed)

Steps to reproduce

1. Run `ocrmypdf input.pdf output.pdf`
2. Notice that it fails at Recompressing JPEGs stage

Files

The file is found here: https://annas-archive.org/md5/aee9796ac090fdc8a93fc654f32020f3

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

16.12.0

Relevant log output

[...]
Optimizable images: JPEGs: 774 PNGs: 0                                                     optimize.py:371
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0%   0/774 -:--:--
An exception occurred while executing the pipeline                                          _common.py:296
Traceback (most recent call last):                                                                        
  File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/_common.py", line 261, in                   
cli_exception_handler                                                                                     
    return fn(options, plugin_manager)                                                                    
  File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in                       
_run_pipeline                                                                                             
    optimize_messages = exec_concurrent(context, executor)                                                
  File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in                       
exec_concurrent                                                                                           
    pdf, messages = postprocess(pdf, context, executor)                                                   
                    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^                                                   
  File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipelines/_common.py", line 460, in                   
postprocess                                                                                               
    return optimize_pdf(pdf_out, context, executor)                                                       
  File "/usr/lib/python3.13/site-packages/ocrmypdf/_pipeline.py", line 992, in optimize_pdf               
    output_pdf, messages = context.plugin_manager.hook.optimize_pdf(                                      
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^                                      
        input_pdf=input_file,                                                                             
        ^^^^^^^^^^^^^^^^^^^^^                                                                             
    ...<3 lines>...                                                                                       
        linearize=should_linearize(input_file, context),                                                  
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                  
    )                                                                                                     
    ^                                                                                                     
  File "/usr/lib/python3.13/site-packages/pluggy/_hooks.py", line 512, in __call__                        
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)                         
           ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
  File "/usr/lib/python3.13/site-packages/pluggy/_manager.py", line 120, in _hookexec                     
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)                                  
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                  
  File "/usr/lib/python3.13/site-packages/pluggy/_callers.py", line 167, in _multicall                    
    raise exception                                                                                       
  File "/usr/lib/python3.13/site-packages/pluggy/_callers.py", line 121, in _multicall                    
    res = hook_impl.function(*args)                                                                       
  File "/usr/lib/python3.13/site-packages/ocrmypdf/builtin_plugins/optimize.py", line 145,                
in optimize_pdf                                                                                           
    result_path = optimize(input_pdf, output_pdf, context, save_settings, executor)                       
  File "/usr/lib/python3.13/site-packages/ocrmypdf/optimize.py", line 727, in optimize                    
    transcode_jpegs(pdf, jpegs, root, options, executor)                                                  
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                  
  File "/usr/lib/python3.13/site-packages/ocrmypdf/optimize.py", line 512, in                             
transcode_jpegs                                                                                           
    executor(                                                                                             
    ~~~~~~~~^                                                                                             
        use_threads=True,  # Processes are significantly slower at this task                              
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                              
    ...<9 lines>...                                                                                       
        task_finished=finish_jpeg,                                                                        
        ^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                        
    )                                                                                                     
    ^                                                                                                     
  File "/usr/lib/python3.13/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__                  
    self._execute(                                                                                        
    ~~~~~~~~~~~~~^                                                                                        
        use_threads=use_threads,                                                                          
        ^^^^^^^^^^^^^^^^^^^^^^^^                                                                          
    ...<5 lines>...                                                                                       
        task_finished=task_finished,                                                                      
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                      
    )                                                                                                     
    ^                                                                                                     
  File "/usr/lib/python3.13/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line                  
162, in _execute                                                                                          
    result = future.result()                                                                              
  File "/usr/lib/python3.13/concurrent/futures/_base.py", line 449, in result                             
    return self.__get_result()                                                                            
           ~~~~~~~~~~~~~~~~~^^                                                                            
  File "/usr/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result                       
    raise self._exception                                                                                 
  File "/usr/lib/python3.13/concurrent/futures/thread.py", line 59, in run                                
    result = self.fn(*self.args, **self.kwargs)                                                           
  File "/usr/lib/python3.13/site-packages/ocrmypdf/optimize.py", line 484, in                             
_optimize_jpeg                                                                                            
    im.save(opt_jpg, optimize=True, quality=jpeg_quality)                                                 
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                 
  File "/usr/lib/python3.13/site-packages/PIL/Image.py", line 2539, in save                               
    self.load()                                                                                           
    ~~~~~~~~~^^                                                                                           
  File "/usr/lib/python3.13/site-packages/PIL/ImageFile.py", line 391, in load                            
    raise OSError(msg)                                                                                    
OSError: image file is truncated (1 bytes not processed)

dryBoneMarrow avatar Dec 06 '25 15:12 dryBoneMarrow

We also have this issue with our scanning pipeline after the upgrade to 16.12.0. We've processed thousands of documents so far without ever seeing this failure and today our pipeline choked with OSError: image file is truncated on two of the 40 files our EPSON document scanner produced today. So, this very much looks like a regression to use.

sirthias avatar Dec 08 '25 12:12 sirthias

Also seeing this on mac today for latest release. Here is the output.

Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 240/240 0:00:00
Start processing 12 pages concurrently                                                                                                                                                                                                                                                                                ocr.py:96
    9 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
    2 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
    5 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
    4 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
    8 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
    3 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
    6 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
    7 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
   83 [tesseract] Image too small to scale!! (2x36 vs min width of 3)                                                                                                                                                                                                                                          tesseract.py:269
   83 [tesseract] Line cannot be recognized!!                                                                                                                                                                                                                                                                  tesseract.py:269
   83 [tesseract] Image too small to scale!! (2x36 vs min width of 3)                                                                                                                                                                                                                                          tesseract.py:269
   83 [tesseract] Line cannot be recognized!!                                                                                                                                                                                                                                                                  tesseract.py:269
  106 [tesseract] Image too small to scale!! (2x36 vs min width of 3)                                                                                                                                                                                                                                          tesseract.py:269
  106 [tesseract] Line cannot be recognized!!                                                                                                                                                                                                                                                                  tesseract.py:269
  130 [tesseract] Image too small to scale!! (2x36 vs min width of 3)                                                                                                                                                                                                                                          tesseract.py:269
  130 [tesseract] Line cannot be recognized!!                                                                                                                                                                                                                                                                  tesseract.py:269
  130 [tesseract] Image too small to scale!! (2x36 vs min width of 3)                                                                                                                                                                                                                                          tesseract.py:269
  130 [tesseract] Line cannot be recognized!!                                                                                                                                                                                                                                                                  tesseract.py:269
  134 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
  167 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
  165 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
  172 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
  214 [tesseract] lots of diacritics - possibly poor OCR                                                                                                                                                                                                                                                       tesseract.py:251
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 240/240 0:00:00
Postprocessing...                                                                                                                                                                                                                                                                                                    ocr.py:144
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 240/240 0:00:00
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                                                                                                                            _metadata.py:63
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0%   0/240 -:--:--
An exception occurred while executing the pipeline                                                                                                                                                                                                                                                               _common.py:296
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/_common.py", line 261, in cli_exception_handler
    return fn(options, plugin_manager)
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline
    optimize_messages = exec_concurrent(context, executor)
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in exec_concurrent
    pdf, messages = postprocess(pdf, context, executor)
                    ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipelines/_common.py", line 460, in postprocess
    return optimize_pdf(pdf_out, context, executor)
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_pipeline.py", line 992, in optimize_pdf
    output_pdf, messages = context.plugin_manager.hook.optimize_pdf(
                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        input_pdf=input_file,
        ^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        linearize=should_linearize(input_file, context),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall
    raise exception
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall
    res = hook_impl.function(*args)
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/builtin_plugins/optimize.py", line 145, in optimize_pdf
    result_path = optimize(input_pdf, output_pdf, context, save_settings, executor)
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/optimize.py", line 727, in optimize
    transcode_jpegs(pdf, jpegs, root, options, executor)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/optimize.py", line 512, in transcode_jpegs
    executor(
    ~~~~~~~~^
        use_threads=True,  # Processes are significantly slower at this task
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<9 lines>...
        task_finished=finish_jpeg,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/_concurrent.py", line 78, in __call__
    self._execute(
    ~~~~~~~~~~~~~^
        use_threads=use_threads,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
        task_finished=task_finished,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/builtin_plugins/concurrency.py", line 162, in _execute
    result = future.result()
  File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/_base.py", line 443, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/_base.py", line 395, in __get_result
    raise self._exception
  File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/thread.py", line 86, in run
    result = ctx.run(self.task)
  File "/opt/homebrew/Cellar/[email protected]/3.14.2/Frameworks/Python.framework/Versions/3.14/lib/python3.14/concurrent/futures/thread.py", line 73, in run
    return fn(*args, **kwargs)
  File "/opt/homebrew/Cellar/ocrmypdf/16.12.0/libexec/lib/python3.14/site-packages/ocrmypdf/optimize.py", line 484, in _optimize_jpeg
    im.save(opt_jpg, optimize=True, quality=jpeg_quality)
    ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/opt/pillow/lib/python3.14/site-packages/PIL/Image.py", line 2539, in save
    self.load()
    ~~~~~~~~~^^
  File "/opt/homebrew/opt/pillow/lib/python3.14/site-packages/PIL/ImageFile.py", line 391, in load
    raise OSError(msg)
OSError: image file is truncated (6 bytes not processed```

nmann4 avatar Dec 08 '25 21:12 nmann4

I determined that the issue is in Ghostscript 10.6 and reported it here: https://bugs.ghostscript.com/show_bug.cgi?id=708961

For now, you can work around the issue by downgrading to an earlier Ghostscript or using

ocrmypdf --output-type pdf which does not use Ghostscript to create a PDF/A, thereby avoiding the issue.

jbarlow83 avatar Dec 08 '25 22:12 jbarlow83

Thanks for the quick response. Using --output-type pdf was good enough for my use case. Appreciate it.

nmann4 avatar Dec 08 '25 23:12 nmann4

Fixed in 16.13.0 (hopefully) with detection and repair of images corrupted by Ghostscript.

jbarlow83 avatar Dec 24 '25 07:12 jbarlow83