Crash on non-ASCII input.
Description
Crash on non-ASCII input: UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
Steps to reproduce the bug
To make it easier, this will download mc3362.pdf.
wget https://github.com/user-attachments/files/16489263/mc3362.pdf && pdf2txt.py mc3362.pdf
Error produced
Traceback (most recent call last):
File "pdf2txt.py", line 115, in <module>
if __name__ == '__main__': sys.exit(main(sys.argv))
^^^^^^^^^^^^^^
File "pdf2txt.py", line 110, in main
interpreter.process_page(page)
File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 841, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 854, in render_contents
self.execute(list_value(streams))
File "/lib/python3.12/site-packages/pdfminer/pdfinterp.py", line 869, in execute
name = keyword_name(obj).decode('ascii')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)
What version of pdfminer.six are you using? I can't reproduce this with either Python 3.11 or 3.12 and pdfminer.six v20240706.
Looks old.
./lib/python3.12/site-packages/pdfminer-20191125.dist-info
Unsure why it would be old, I used pip to install it. I'm not really a python person.
Closing since @dhdaines can't reproduce. Probalby you can fix this by removing all versions of pdfminer and pdfminer.six and then installing the lastest version from pip.
I can confirm the error is gone with a new download in a new python venv. Probably it's a historic bug.
For reference, here is the output of pip freeze:
cffi==1.17.1
charset-normalizer==3.4.0
cryptography==44.0.0
pdfminer.six==20240706
pillow==11.0.0
pycparser==2.22
However, I get zero output rather than the desired output, which is not as expected/desired. Perhaps you could tell me if you can get any text output from the file specified?
I also tried various command line options like pdf2txt.py -A -n -t text mc3362.pdf .. same result.