pdfminer
pdfminer copied to clipboard
Problem converting pdf to txt with pdf2txt.py
I am trying to convert the following pdf to txt. http://www.kabupro.jp/edp/20140529/S1001UPO.pdf
Using the following command pdf2txt.py -o text.txt S1001UPO.pdf
The document is encrypted so i remove this first; however, even after doing this i get the below error.
I suspect the issue is with "TypeError: must be encoded string without NULL bytes, not str", to which this seems to offer a solution - http://stackoverflow.com/questions/18265084/typeerror-must-be-string-without-null-bytes-not-str
Could someone point me to a work around? Thank you!!
Traceback (most recent call last):
File "/Users/JB1/anaconda/bin/pdf2txt.py", line 115, in
@JSB97 I have also encountered the same error. The problematic snippet in cmapdb.py seems to be -
def _load_data(klass, name):
filename = '%s.pickle.gz' % name
if klass.debug:
print >>sys.stderr, 'loading:', name
cmap_paths = (os.environ.get('CMAP_PATH', '/usr/share/pdfminer/'),
os.path.join(os.path.dirname(__file__), 'cmap'),)
for directory in cmap_paths:
path = os.path.join(directory, filename)
Printing the variable "filename" gives me -
to-unicode-PDFXC30-Identity.pickle.gz
Printing "repr(filename)" yields -
'to-unicode-PDFXC30-Identity\x00\x00.pickle.gz'
Apparently, these \x00 characters are causing the issue. One fix that solved this issue for me was -
filename = filename.replace('\0', '')
I am not sure what is causing this issue, though.
@euske Is there a way to make a permanent fix for this?
A fork of the repository pdfminer.six has been created at - https://github.com/strideai/pdfminer.six . This issue has been fixed in this fork, and we will now be maintaining the forked repository.
Hi @tataganesh , after test still failed. simple1.pdf