OCRmyPDF
OCRmyPDF copied to clipboard
[BUG] ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),)
Describe the bug
In paperless-ngx , I'm getting an error similar to ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([]),)
(paperless-ngx/paperless-ngx/issues/2394)
I can reproduce this issue when calling ocrmypdf directly but I'm not sure if the issue is a bug in ocrmypdf, pikepdf, or with latex producing bad PDF files.
To Reproduce
docker run -it --rm -v $(pwd):/data jbarlow83/ocrmypdf /data/3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf /data/output.pdf
Output:
$ docker run -it --rm -v $(pwd):/data jbarlow83/ocrmypdf /data/3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf /data/output.pdf
Scanning contents: 14%|████████████████████▉ | 5/36 [00:00<00:00, 106.47page/s]
An exception occurred while executing the pipeline
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_sync.py", line 378, in run_pipeline
pdfinfo = get_pdfinfo(
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_pipeline.py", line 165, in get_pdfinfo
return PdfInfo(
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 932, in __init__
self._pages = _pdf_pageinfo_concurrent(
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 709, in _pdf_pageinfo_concurrent
executor(
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/_concurrent.py", line 87, in __call__
self._execute(
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/builtin_plugins/concurrency.py", line 141, in _execute
result = future.result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 666, in _pdf_pageinfo_sync
page = PageInfo(pdf, pageno, infile, check_pages, detailed_analysis)
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 746, in __init__
self._gather_pageinfo(pdf, pageno, infile, check_pages, detailed_analysis)
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 792, in _gather_pageinfo
for info in _process_content_streams(
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 594, in _process_content_streams
yield from _find_form_xobject_images(pdf, container, contentsinfo)
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 541, in _find_form_xobject_images
yield from _process_content_streams(
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 586, in _process_content_streams
contentsinfo = _interpret_contents(container, initial_shorthand)
File "/usr/local/lib/python3.10/dist-packages/ocrmypdf/pdfinfo/info.py", line 236, in _interpret_contents
ctm = PdfMatrix(operands) @ ctm
File "/usr/local/lib/python3.10/dist-packages/pikepdf/models/matrix.py", line 56, in __init__
raise ValueError('invalid arguments: ' + repr(args))
ValueError: invalid arguments: (pikepdf._qpdf._ObjectList([0, Decimal('1.0000001'), Decimal('407.24936'), Decimal('267.78995')]),)
Example file If your issue is a problem that affects only certain files, and we will require an input file (PDF or image) that demonstrates your issue.
- https://davidl.me/resources/papers/Li_Progressive_Multi_scale_Light_Field_Networks_3DV2022.pdf
- https://storage.googleapis.com/pub-tools-public-publication-data/pdf/3fca60c45241f0ed03e5e6eea0a49b932c0b0c10.pdf
Expected behavior OCRmyPDF should run without crashing
Screenshots If applicable, add screenshots to help explain your problem.
System
- OS: Ubuntu 22.04
- OCRmyPDF Version:
ocrmypdf --version14.0.2.dev18+gf072e911.d20230104 - How did you install ocrmypdf? Did you use a system package manager,
pip, or a Docker image? Docker