I am trying to convert the following pdf to txt. http://www.kabupro.jp/edp/20140529/S1001UPO.pdf

Using the following command pdf2txt.py -o text.txt S1001UPO.pdf

The document is encrypted so i remove this first; however, even after doing this i get the below error.

I suspect the issue is with "TypeError: must be encoded string without NULL bytes, not str", to which this seems to offer a solution - http://stackoverflow.com/questions/18265084/typeerror-must-be-string-without-null-bytes-not-str

Could someone point me to a work around? Thank you!!

Traceback (most recent call last): File "/Users/JB1/anaconda/bin/pdf2txt.py", line 115, in if name == 'main': sys.exit(main(sys.argv)) File "/Users/JB1/anaconda/bin/pdf2txt.py", line 109, in main interpreter.process_page(page) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 833, in process_page self.render_contents(page.resources, page.contents, ctm=ctm) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 844, in render_contents self.init_resources(resources) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 348, in init_resources self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 196, in get_font font = self.get_font(None, subspec) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 187, in get_font font = PDFCIDFont(self, spec) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdffont.py", line 668, in init self.unicode_map = CMapDB.get_unicode_map(self.cidcoding, self.cmap.is_vertical()) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 276, in get_unicode_map data = klass._load_data('to-unicode-%s' % name) File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 247, in _load_data if os.path.exists(path): File "/Users/JB1/anaconda/lib/python2.7/genericpath.py", line 18, in exists os.stat(path) TypeError: must be encoded string without NULL bytes, not str

Apr 04 '15 07:04 JSB97

@JSB97 I have also encountered the same error. The problematic snippet in cmapdb.py seems to be -

    def _load_data(klass, name):
        filename = '%s.pickle.gz' % name
        if klass.debug:
            print >>sys.stderr, 'loading:', name
        cmap_paths = (os.environ.get('CMAP_PATH', '/usr/share/pdfminer/'),
                      os.path.join(os.path.dirname(__file__), 'cmap'),)
        for directory in cmap_paths:
            path = os.path.join(directory, filename)

Printing the variable "filename" gives me - to-unicode-PDFXC30-Identity.pickle.gz Printing "repr(filename)" yields - 'to-unicode-PDFXC30-Identity\x00\x00.pickle.gz' Apparently, these \x00 characters are causing the issue. One fix that solved this issue for me was - filename = filename.replace('\0', '') I am not sure what is causing this issue, though. @euske Is there a way to make a permanent fix for this?

Aug 17 '17 05:08 tataganesh

A fork of the repository pdfminer.six has been created at - https://github.com/strideai/pdfminer.six . This issue has been fixed in this fork, and we will now be maintaining the forked repository.

Nov 07 '17 05:11 tataganesh

Hi @tataganesh , after test still failed. simple1.pdf

Apr 28 '23 13:04 softboy99

pdfminer
pdfminer copied to clipboard

Problem converting pdf to txt with pdf2txt.py

Could someone point me to a work around? Thank you!!

pdfminer pdfminer copied to clipboard

Problem converting pdf to txt with pdf2txt.py

Could someone point me to a work around? Thank you!!

pdfminer
pdfminer copied to clipboard