pdfminer.six
pdfminer.six copied to clipboard
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct
In a call to get_pages
, this PDF raised an exception.
pdfminer version: refs/tags/20201018 PDF: https://source.android.com/compatibility/5.1/android-5.1-cdd.pdf
My code looks like this:
raw_input = io.BytesIO(content) # The file contents
html_output = io.BytesIO()
resources = pdfinterp.PDFResourceManager()
device = converter.HTMLConverter(resources, html_output)
interpreter = pdfinterp.PDFPageInterpreter(resources, device)
for page in pdfpage.PDFPage.get_pages(raw_input):
interpreter.process_page(page)
Stack trace:
File ""...:
for page in pdfpage.PDFPage.get_pages(raw_input):
File "pdfminer/pdfpage.py", line 128, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "pdfminer/pdfdocument.py", line 578, in __init__
xref.load(parser)
File "pdfminer/pdfdocument.py", line 190, in load
(_, obj) = parser.nextobject()
File "pdfminer/psparser.py", line 590, in nextobject
raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Dest', /'', /b'\xd9', /b'\x0e', /b'\x86', /b'p\x82\x17\xe1\x8a\xed\x91\x07HS', /b'#', 7, /b'B\x11,\xe4\xa8\xf8', /'Border', [0, 0, 0], /'Type', /'Annot', /'Rect', [69.36, 659.6, 201.84, 673.04], /'Subtype', /'Link']
I can reproduce this issue:
$ python tools/pdf2txt.py ~/Downloads/android-5.1-cdd.pdf
Traceback (most recent call last):
File "tools/pdf2txt.py", line 204, in <module>
sys.exit(main())
File "tools/pdf2txt.py", line 198, in main
outfp = extract_text(**vars(A))
File "tools/pdf2txt.py", line 66, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/high_level.py", line 83, in extract_text_to_fp
caching=not disable_caching):
File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfpage.py", line 128, in get_pages
doc = PDFDocument(parser, password=password, caching=caching)
File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 578, in __init__
xref.load(parser)
File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/pdfdocument.py", line 190, in load
(_, obj) = parser.nextobject()
File "/home/pieter/projects/pdfminer/pdfminer-upstream/pdfminer/psparser.py", line 590, in nextobject
raise PSSyntaxError(error_msg)
pdfminer.psparser.PSSyntaxError: Invalid dictionary construct: [/'Dest', /'', /b'\xd9', /b'\x0e', /b'\x86', /b'p\x82\x17\xe1\x8a\xed\x91\x07HS', /b'#', 7, /b'B\x11,\xe4\xa8\xf8', /'Border', [0, 0, 0], /'Type', /'Annot', /'Rect', [69.36, 659.6, 201.84, 673.04], /'Subtype', /'Link']
@pietermarsman Is it failing because of OCR? May be temporary fix it by catching exception and bypassing OCR? Just want documents inside Paperless. Sorry this intended to be in paperless. Don't see delete option
It is failing because the syntax of the PDF is invalid. I expect that we wont fix this issue, but I haven't looked into it yet so not sure about that.
This is unrelated to OCR.
https://arxiv.org/pdf/2207.05378.pdf same issue for this pdf
It is failing because the syntax of the PDF is invalid. I expect that we wont fix this issue, but I haven't looked into it yet so not sure about that.
This is unrelated to OCR.
the PDF could be open by the pdf reader, which part is invalid? could this be ignored?