pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

pdfminer.psparser.PSSyntaxError: Invalid dictionary construct

Open markmcd opened this issue 4 years ago • 5 comments

In a call to get_pages, this PDF raised an exception.

pdfminer version: refs/tags/20201018 PDF: https://source.android.com/compatibility/5.1/android-5.1-cdd.pdf

My code looks like this:

raw_input = io.BytesIO(content)  # The file contents
html_output = io.BytesIO()
resources = pdfinterp.PDFResourceManager()
device = converter.HTMLConverter(resources, html_output)
interpreter = pdfinterp.PDFPageInterpreter(resources, device)

for page in pdfpage.PDFPage.get_pages(raw_input):
  interpreter.process_page(page)

Stack trace:

  File ""...:
    for page in pdfpage.PDFPage.get_pages(raw_input):
  File "pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "pdfminer/pdfdocument.py", line 578, in __init__
    xref.load(parser)
  File "pdfminer/pdfdocument.py", line 190, in load
    (_, obj) = parser.nextobject()
  File "pdfminer/psparser.py", line 590, in nextobject
    raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Dest', /'', /b'\xd9', /b'\x0e', /b'\x86', /b'p\x82\x17\xe1\x8a\xed\x91\x07HS', /b'#', 7, /b'B\x11,\xe4\xa8\xf8', /'Border', [0, 0, 0], /'Type', /'Annot', /'Rect', [69.36, 659.6, 201.84, 673.04], /'Subtype', /'Link'] 

markmcd avatar Nov 26 '20 09:11 markmcd

I can reproduce this issue:

$ python tools/pdf2txt.py ~/Downloads/android-5.1-cdd.pdf 
Traceback (most recent call last):
  File "tools/pdf2txt.py", line 204, in <module>
    sys.exit(main())
  File "tools/pdf2txt.py", line 198, in main
    outfp = extract_text(**vars(A))
  File "tools/pdf2txt.py", line 66, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/high_level.py", line 83, in extract_text_to_fp
    caching=not disable_caching):
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 578, in __init__
    xref.load(parser)
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 190, in load
    (_, obj) = parser.nextobject()
  File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/psparser.py", line 590, in nextobject
    raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Dest', /'', /b'\xd9', /b'\x0e', /b'\x86', /b'p\x82\x17\xe1\x8a\xed\x91\x07HS', /b'#', 7, /b'B\x11,\xe4\xa8\xf8', /'Border', [0, 0, 0], /'Type', /'Annot', /'Rect', [69.36, 659.6, 201.84, 673.04], /'Subtype', /'Link']

pietermarsman avatar Dec 11 '20 17:12 pietermarsman

@pietermarsman Is it failing because of OCR? May be temporary fix it by catching exception and bypassing OCR? Just want documents inside Paperless. Sorry this intended to be in paperless. Don't see delete option

stepanov1975 avatar May 09 '22 14:05 stepanov1975

It is failing because the syntax of the PDF is invalid. I expect that we wont fix this issue, but I haven't looked into it yet so not sure about that.

This is unrelated to OCR.

pietermarsman avatar May 24 '22 17:05 pietermarsman

https://arxiv.org/pdf/2207.05378.pdf same issue for this pdf

johnson7788 avatar Sep 02 '22 06:09 johnson7788

It is failing because the syntax of the PDF is invalid. I expect that we wont fix this issue, but I haven't looked into it yet so not sure about that.

This is unrelated to OCR.

the PDF could be open by the pdf reader, which part is invalid? could this be ignored?

xsank avatar Apr 09 '24 09:04 xsank