OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

[3rdparty]: Paperless-ngx fails on consuming a file

Open GooRoo opened this issue 9 months ago • 6 comments

Simple sanity checks

  • [x] This is an issue with an app that uses OCRmyPDF for OCR
  • [x] I am using a recent version of the third party app
  • [x] I will include a file that reproduces the issuse

Third party app name and version

Paperless-ngx 2.14.7

Describe the bug

Paperless can't consume a file.

Steps to reproduce

1. Import attached file into Paperless-ngx.
2. OCR is automatically triggered.
3. The process is failed with the following errors in log.

Files

o451229v21_160992A98S_202401.pdf

OCRmyPDF version

No response

Relevant log output

[2025-03-17 23:01:37,509] [ERROR] [paperless.consumer] Error occurred while consuming document o451229v21_160992A98S_202401.pdf: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 288, in generate_pdfa
    p = run_polling_stderr(
        ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/subprocess/__init__.py", line 114, in run_polling_stderr
    raise CalledProcessError(proc.returncode, args, output=None, stderr=stderr)
subprocess.CalledProcessError: Command '['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '/tmp/ocrmypdf.io.p49cqgey/pdfa.pdf', '-sstdout=%stderr', '/tmp/ocrmypdf.io.p49cqgey/pdfa.ps', '/tmp/ocrmypdf.io.p49cqgey/fix_docinfo.pdf']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse
    ocrmypdf.ocr(**args)
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline
    return _run_pipeline(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline
    optimize_messages = exec_concurrent(context, executor)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in exec_concurrent
    pdf, messages = postprocess(pdf, context, executor)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 453, in postprocess
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 912, in convert_to_pdfa
    context.plugin_manager.hook.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 131, in generate_pdfa
    ghostscript.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 301, in generate_pdfa
    raise SubprocessOutputError('Ghostscript PDF/A rendering failed') from e
ocrmypdf.exceptions.SubprocessOutputError: Ghostscript PDF/A rendering failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap
    raise exc_info[1]
  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run
    document_parser.parse(self.working_copy, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 405, in parse
    raise ParseError(
documents.parsers.ParseError: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
[2025-03-17 23:01:37,560] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: o451229v21_160992A98S_202401.pdf: Error occurred while consuming document o451229v21_160992A98S_202401.pdf: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 288, in generate_pdfa
    p = run_polling_stderr(
        ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/subprocess/__init__.py", line 114, in run_polling_stderr
    raise CalledProcessError(proc.returncode, args, output=None, stderr=stderr)
subprocess.CalledProcessError: Command '['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '/tmp/ocrmypdf.io.p49cqgey/pdfa.pdf', '-sstdout=%stderr', '/tmp/ocrmypdf.io.p49cqgey/pdfa.ps', '/tmp/ocrmypdf.io.p49cqgey/fix_docinfo.pdf']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 382, in parse
    ocrmypdf.ocr(**args)
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/api.py", line 380, in ocr
    return run_pipeline(options=options, plugin_manager=plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 214, in run_pipeline
    return _run_pipeline(options, plugin_manager)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 181, in _run_pipeline
    optimize_messages = exec_concurrent(context, executor)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py", line 145, in exec_concurrent
    pdf, messages = postprocess(pdf, context, executor)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py", line 453, in postprocess
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py", line 912, in convert_to_pdfa
    context.plugin_manager.hook.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/usr/local/lib/python3.12/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/builtin_plugins/ghostscript.py", line 131, in generate_pdfa
    ghostscript.generate_pdfa(
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_exec/ghostscript.py", line 301, in generate_pdfa
    raise SubprocessOutputError('Ghostscript PDF/A rendering failed') from e
ocrmypdf.exceptions.SubprocessOutputError: Ghostscript PDF/A rendering failed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/asgiref/sync.py", line 327, in main_wrap
    raise exc_info[1]
  File "/usr/src/paperless/src/documents/consumer.py", line 477, in run
    document_parser.parse(self.working_copy, mime_type, self.filename)
  File "/usr/src/paperless/src/paperless_tesseract/parsers.py", line 405, in parse
    raise ParseError(
documents.parsers.ParseError: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/src/paperless/src/documents/tasks.py", line 154, in consume_file
    msg = plugin.run()
          ^^^^^^^^^^^^
  File "/usr/src/paperless/src/documents/consumer.py", line 509, in run
    self._fail(
  File "/usr/src/paperless/src/documents/consumer.py", line 151, in _fail
    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception
documents.consumer.ConsumerError: o451229v21_160992A98S_202401.pdf: Error occurred while consuming document o451229v21_160992A98S_202401.pdf: SubprocessOutputError: Ghostscript PDF/A rendering failed. See logs for more information.

GooRoo avatar Mar 17 '25 22:03 GooRoo

I'm getting the exactly same error with an invoice file I tried to upload to paperless-ngx.

dsteinborn avatar Mar 31 '25 21:03 dsteinborn

@GooRoo Which OCR mode did you use (skip, redo, force)?

I used your file in my Paperless v2.14.7 instance in skip mode and the log was full of

[2025-04-04 12:54:33,868] [ERROR] [ocrmypdf.optimize] xref 7147: While extracting this image, an error occurred

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/optimize.py", line 334, in extract_images

    result = extract_fn(

             ^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/optimize.py", line 224, in extract_image_generic

    elif not pim.indexed and pim.colorspace in pim.SIMPLE_COLORSPACES:

                             ^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/site-packages/pikepdf/models/image.py", line 211, in colorspace

    raise NotImplementedError(

NotImplementedError: not sure how to get colorspace: ['/Separation', '/Black', '/DeviceRGB', pikepdf.Dictionary({

  "/C0": [ 1, 1, 1 ],

  "/C1": [ Decimal('0.136691'), Decimal('0.121947'), Decimal('0.125305') ],

  "/Domain": [ 0, 1 ],

  "/FunctionType": 2,

  "/N": 1,

  "/Range": [ 0, 1, 0, 1, 0, 1 ]

})]

but the document was finally consumed and usable.

kernie avatar Apr 04 '25 11:04 kernie

@kernie I haven't changed this setting, and its default value is skip I believe.

GooRoo avatar Apr 04 '25 15:04 GooRoo

I am experiencing the same issue, with a document containing personal details: I can share it with the maintainer privately if needed.

EDIT: ah, no, I have a SIGSEGV with my document, not (simply) a non-zero return code.

subprocess.CalledProcessError: Command '['/nix/store/3nspm6rrs988yibwh6szhnfhrysgcydx-ghostscript-10.05.1/bin/gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '/tmp/ocrmypdf.io.94mjv4v2/pdfa.pdf', '-sstdout=%stderr', '/tmp/ocrmypdf.io.94mjv4v2/pdfa.ps', '/tmp/ocrmypdf.io.94mjv4v2/fix_docinfo.pdf']' died with <Signals.SIGSEGV: 11>.

ambroisie avatar Jul 07 '25 15:07 ambroisie

@ambroisie Please share document. Brief instructions here. https://github.com/ocrmypdf/OCRmyPDF/wiki

It's most likely a Ghostscript issue.

jbarlow83 avatar Jul 21 '25 20:07 jbarlow83

@jbarlow83 2025.zip

ambroisie avatar Jul 22 '25 10:07 ambroisie