OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

[BUG] ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),)

Open dli7319 opened this issue 2 years ago • 0 comments

Describe the bug In paperless-ngx , I'm getting an error similar to ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),)
(paperless-ngx/paperless-ngx/issues/2394) I can reproduce this issue when calling ocrmypdf directly but I'm not sure if the issue is a bug in ocrmypdf, pikepdf, or with latex producing bad PDF files.

To Reproduce

docker run -it --rm -v $(pwd):/data jbarlow83/ocrmypdf /data/3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf /data/output.pdf

Output:

$ docker run -it --rm -v $(pwd):/data jbarlow83/ocrmypdf /data/3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf /data/output.pdf
Scanning contents:  14%|████████████████████▉                                                                                                                                  | 5/36 [00:00<00:00, 106.47page/s]
An exception occurred while executing the pipeline
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_sync.py", line 378, in run_pipeline
    pdfinfo = get_pdfinfo(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo
    return PdfInfo(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 932, in __init__
    self._pages = _pdf_pageinfo_concurrent(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 709, in _pdf_pageinfo_concurrent
    executor(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_concurrent.py", line 87, in __call__
    self._execute(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute
    result = future.result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 666, in _pdf_pageinfo_sync
    page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 746, in __init__
    self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 792, in _gather_pageinfo
    for info in _process_content_streams(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 594, in _process_content_streams
    yield from _find_form_xobject_images(pdf, container, contentsinfo)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 541, in _find_form_xobject_images
    yield from _process_content_streams(
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 586, in _process_content_streams
    contentsinfo = _interpret_contents(container, initial_shorthand)
  File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 236, in _interpret_contents
    ctm = PdfMatrix(operands) @ ctm
  File "/usr/local/lib/python3.10/dist-packages/pikepdf/models/matrix.py", line 56, in __init__
    raise ValueError('invalid arguments: ' + repr(args))
ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)

Example file If your issue is a problem that affects only certain files, and we will require an input file (PDF or image) that demonstrates your issue.

  1. https://davidl.me/resources/papers/Li_Progressive_Multi_scale_Light_Field_Networks_3DV2022.pdf
  2. https://storage.googleapis.com/pub-tools-public-publication-data/pdf/3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf

Expected behavior OCRmyPDF should run without crashing

Screenshots If applicable, add screenshots to help explain your problem.

System

  • OS: Ubuntu 22.04
  • OCRmyPDF Version: ocrmypdf --version 14.0.2.dev18+gf072e911.d20230104
  • How did you install ocrmypdf? Did you use a system package manager, pip, or a Docker image? Docker

dli7319 avatar Jan 10 '23 02:01 dli7319