polyfile icon indicating copy to clipboard operation
polyfile copied to clipboard

Two crashes when parsing PDFs

Open TACIXAT opened this issue 5 years ago • 2 comments

Govdocs -

000899.pdf 001940.pdf

Parsing PDF obj 62 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 38, in _emit_dict
    value_end = value[-1].offset.offset + len(value[-1].token)
IndexError: list index out of range
Parsing PDF obj 424 0Traceback (most recent call last):
  File "/usr/local/bin/polyfile", line 11, in <module>
    load_entry_point('polyfile===0.1.6-git', 'console_scripts', 'polyfile')()
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/__main__.py", line 99, in main
    for match in matcher.match(file_path, progress_callback=progress_callback, trid_defs=trid_defs):
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/polyfile.py", line 178, in match
    yield from submatch_iter
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 296, in submatch
    yield from parse_pdf(file_stream, matcher=self.matcher, parent=self)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 290, in parse_pdf
    yield from parse_object(file_stream, object, matcher=matcher, parent=parent)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 118, in parse_object
    yield from _emit_dict(oPDFParseDictionary.parsed, obj, parent.offset)
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in _emit_dict
    ''.join(v.token for v in value),
  File "/home/taxicat/.local/lib/python3.6/site-packages/polyfile-0.1.6_git-py3.6.egg/polyfile/pdf.py", line 61, in <genexpr>
    ''.join(v.token for v in value),
AttributeError: 'str' object has no attribute 'token'

TACIXAT avatar Jul 07 '20 02:07 TACIXAT

I saw the latter bug also, on the first file I tried it on (a big one that I think was generated from a docx file), under Python 3.8: AttributeError: 'str' object has no attribute 'token'

$ polyfile --version PolyFile version 0.1.7

nealmcb avatar Feb 02 '21 21:02 nealmcb

This was likely fixed by switching PDF parser implementations. We will confirm and close if that is the case.

ESultanik avatar Apr 15 '22 19:04 ESultanik