pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

KeyError: 'ID' when running pdf2txt.py

Open alexkillen opened this issue 4 years ago • 1 comments

Unfortunately I cannot include the PDF as it is a bank statement, but hopefully the details below are enough.

The error is as follows:

DEBUG:pdfminer.pdfdocument:trailer={'Size': 70, 'Root': <PDFObjRef:69>, 'Info': <PDFObjRef:3>, 'Encrypt': <PDFObjRef:2>}
INFO:pdfminer.pdfdocument:trailer: {'Size': 70, 'Root': <PDFObjRef:69>, 'Info': <PDFObjRef:3>, 'Encrypt': <PDFObjRef:2>}
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 195, in <module>
    sys.exit(main())
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 189, in main
    outfp = extract_text(**vars(A))
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/tools/pdf2txt.py", line 57, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/high_level.py", line 79, in extract_text_to_fp
    for page in PDFPage.get_pages(inf,
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/pdfpage.py", line 128, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/alexk/Development/playground/pdfminer_six/pdfminer.six/pdfminer/pdfdocument.py", line 589, in __init__
    self.encryption = (list_value(trailer['ID']),
KeyError: 'ID'

The error occurs when attempting to access the 'ID' property of the File Trailer, but as can be seen in the DEBUG line in the above output, 'ID' is not in the trailer. Note that 'ID' is listed as optional in the PDF spec: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf#page=88.

I managed to workaround the issue by making the following change in pdfminer/pdfdocument.py (line 588-590):

        if 'Encrypt' in trailer:
                # self.encryption = (list_value(trailer['ID']),
                self.encryption = (list_value(trailer['ID']) if 'ID' in trailer else [''.encode('utf-8'), ''.encode('utf-8')],
                                   dict_value(trailer['Encrypt']))

This simply provides empty utf-8 encoded strings as the ID. I'm not sure if this would be the right "fix" but it appeared to work in my case.

alexkillen avatar Aug 14 '20 10:08 alexkillen

Since it is

Optional, but strongly recommended; PDF 1.1)

we should indeed make this more robust by assuming the value can be missing.

It looks like there is no sensible default. So using a tuple of two empty bytes is ok.

I suggest using trailer.get('ID', [b'', b'']).

pietermarsman avatar Sep 13 '20 10:09 pietermarsman