pdfminer.six pdfminer.psparser.PSSyntaxError: Invalid dictionary construct

In a call to get_pages, this PDF raised an exception.

pdfminer version: refs/tags/20201018 PDF: https://source.android.com/compatibility/5.1/android-5.1-cdd.pdf

My code looks like this:

raw_input = io.BytesIO(content)  # The file contents
html_output = io.BytesIO()
resources = pdfinterp.PDFResourceManager()
device = converter.HTMLConverter(resources, html_output)
interpreter = pdfinterp.PDFPageInterpreter(resources, device)

for page in pdfpage.PDFPage.get_pages(raw_input):
  interpreter.process_page(page)

Stack trace:

  File ""...:
    for page in pdfpage.PDFPage.get_pages(raw_input):
  File "pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "pdfminer/pdfdocument.py", line 578, in __init__
    xref.load(parser)
  File "pdfminer/pdfdocument.py", line 190, in load
    (_, obj) = parser.nextobject()
  File "pdfminer/psparser.py", line 590, in nextobject
    raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Dest', /'', /b'\xd9', /b'\x0e', /b'\x86', /b'p\x82\x17\xe1\x8a\xed\x91\x07HS', /b'#', 7, /b'B\x11,\xe4\xa8\xf8', /'Border', [0, 0, 0], /'Type', /'Annot', /'Rect', [69.36, 659.6, 201.84, 673.04], /'Subtype', /'Link']

Nov 26 '20 09:11 markmcd

I can reproduce this issue:

$ python tools/pdf2txt.py ~/Downloads/android-5.1-cdd.pdf 
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/high_level.py", line 83, in extract_text_to_fp
    caching=not disable_caching):
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 578, in __init__
    xref.load(parser)
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 190, in load
    (_, obj) = parser.nextobject()
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/psparser.py", line 590, in nextobject
    raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Dest', /'', /b'\xd9', /b'\x0e', /b'\x86', /b'p\x82\x17\xe1\x8a\xed\x91\x07HS', /b'#', 7, /b'B\x11,\xe4\xa8\xf8', /'Border', [0, 0, 0], /'Type', /'Annot', /'Rect', [69.36, 659.6, 201.84, 673.04], /'Subtype', /'Link']

Dec 11 '20 17:12 pietermarsman

@pietermarsman Is it failing because of OCR? May be temporary fix it by catching exception and bypassing OCR? Just want documents inside Paperless. Sorry this intended to be in paperless. Don't see delete option

May 09 '22 14:05 stepanov1975

It is failing because the syntax of the PDF is invalid. I expect that we wont fix this issue, but I haven't looked into it yet so not sure about that.

This is unrelated to OCR.

May 24 '22 17:05 pietermarsman

https://arxiv.org/pdf/2207.05378.pdf same issue for this pdf

Sep 02 '22 06:09 johnson7788

It is failing because the syntax of the PDF is invalid. I expect that we wont fix this issue, but I haven't looked into it yet so not sure about that.

This is unrelated to OCR.

the PDF could be open by the pdf reader, which part is invalid? could this be ignored?

Apr 09 '24 09:04 xsank

pdfminer.six pdfminer.six copied to clipboard

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct

pdfminer.six
pdfminer.six copied to clipboard