pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

Crash on non-ASCII input.

Open vk2diy opened this issue 1 year ago • 2 comments

Description

Crash on non-ASCII input: UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Steps to reproduce the bug

To make it easier, this will download mc3362.pdf.

  1. wget https://github.com/user-attachments/files/16489263/mc3362.pdf && pdf2txt.py mc3362.pdf

Error produced

Traceback (most recent call last):
  File "pdf2txt.py", line 115, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
                                        ^^^^^^^^^^^^^^
  File "pdf2txt.py", line 110, in main
    interpreter.process_page(page)
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 841, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 854, in render_contents
    self.execute(list_value(streams))
  File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 869, in execute
    name = keyword_name(obj).decode('ascii')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

vk2diy avatar Aug 04 '24 23:08 vk2diy

What version of pdfminer.six are you using? I can't reproduce this with either Python 3.11 or 3.12 and pdfminer.six v20240706.

dhdaines avatar Aug 06 '24 01:08 dhdaines

Looks old.

./lib/python3.12/site-packages/pdfminer-20191125.dist-info

Unsure why it would be old, I used pip to install it. I'm not really a python person.

vk2diy avatar Aug 06 '24 02:08 vk2diy

Closing since @dhdaines can't reproduce. Probalby you can fix this by removing all versions of pdfminer and pdfminer.six and then installing the lastest version from pip.

pietermarsman avatar Nov 26 '24 18:11 pietermarsman

I can confirm the error is gone with a new download in a new python venv. Probably it's a historic bug.

For reference, here is the output of pip freeze:

cffi==1.17.1
charset-normalizer==3.4.0
cryptography==44.0.0
pdfminer.six==20240706
pillow==11.0.0
pycparser==2.22

However, I get zero output rather than the desired output, which is not as expected/desired. Perhaps you could tell me if you can get any text output from the file specified?

I also tried various command line options like pdf2txt.py -A -n -t text mc3362.pdf .. same result.

vk2diy avatar Nov 28 '24 16:11 vk2diy