pdfminer html or xml converter fails with TypeError: write() argument must be str, not bytes

html or xml converter fails with TypeError: write() argument must be str, not bytes

Open Prasaddiwalkar opened this issue 5 years ago • 7 comments

python pdf2txt.py -t xml -o output.xml -d %pdffilepath% fails with following error

Traceback (most recent call last):
  File "pdf2text.py", line 113, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "pdf2text.py", line 92, in main
    device = XMLConverter(rsrcmgr, outfp, laparams=laparams, imagewriter=imagewriter, stripcontrol=stripcontrol)
  File "C:\apps\Python37\lib\site-packages\pdfminer\converter.py", line 442, in __init__
    self.write_header()
  File "C:\apps\Python37\lib\site-packages\pdfminer\converter.py", line 453, in write_header
    self.write('<?xml version="1.0" encoding="%s" ?>\n' % self.codec)
  File "C:\apps\Python37\lib\site-packages\pdfminer\converter.py", line 448, in write
    self.outfp.write(text)
TypeError: write() argument must be str, not bytes

Nov 21 '19 05:11 Prasaddiwalkar

Likely related - dumppdf.py also fails similarly:

Traceback (most recent call last):
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 272, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 269, in main
    dumpall=dumpall, mode=mode, extractdir=extractdir)
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 222, in dumppdf
    dumptrailers(outfp, doc)
  File "/home/wynand/.virtualenvs/pdf/bin/dumppdf.py", line 95, in dumptrailers
    out.write('<trailer>\n')
TypeError: a bytes-like object is required, not 'str'

Nov 21 '19 05:11 6A61736F6E206E61646572

yes I also observed the same for dump as well

Traceback (most recent call last):
  File "dumppdf.py", line 272, in <module>
    if __name__ == '__main__': sys.exit(main(sys.argv))
  File "dumppdf.py", line 269, in main
    dumpall=dumpall, mode=mode, extractdir=extractdir)
  File "dumppdf.py", line 222, in dumppdf
    dumptrailers(outfp, doc)
  File "dumppdf.py", line 95, in dumptrailers
    out.write('<trailer>\n')
TypeError: a bytes-like object is required, not 'str'

Nov 21 '19 05:11 Prasaddiwalkar

Fails with the sample PDFs provided by the repo so it's not an issue with our files then.

Nov 21 '19 05:11 6A61736F6E206E61646572

pdfminer.six works, going to use that for now.

Nov 21 '19 05:11 6A61736F6E206E61646572

yes pdfminer.six is working for me as well but it gives me node for each character.

I am expecting it should give me text node for each word or each line

Nov 21 '19 07:11 Prasaddiwalkar

in pdfminer.six it does not maintain the sequence of text from pdf flie for text and xml

Nov 21 '19 07:11 Prasaddiwalkar

Any update here?

Apr 10 '24 15:04 wvanrensburg

pdfminer pdfminer copied to clipboard

html or xml converter fails with TypeError: write() argument must be str, not bytes

pdfminer
pdfminer copied to clipboard