pdfminer.six Getting import error

ImportError Traceback (most recent call last) in <cell line: 1>() ----> 1 from pdfminer.high_level import extract_text 2 import pdfminer.high_level

/usr/local/lib/python3.10/dist-packages/pdfminer/high_level.py in 6 from typing import Any, BinaryIO, Container, Iterator, Optional, cast 7 ----> 8 from .converter import ( 9 XMLConverter, 10 HTMLConverter,

ImportError: cannot import name 'HOCRConverter' from 'pdfminer.converter' (/usr/local/lib/python3.10/dist-packages/pdfminer/converter.py)

Apr 28 '23 10:04 ShubhamGupta2505

Having the same issue

from pdfminer.high_level import extract_text
extract_text(file_path)

Got an error: ImportError: cannot import name 'HOCRConverter' from 'pdfminer.converter' (/usr/local/lib/python3.10/site-packages/pdfminer/converter.py)

Jul 03 '23 11:07 skitsanos

same here

Jul 21 '23 07:07 cristicretu

Same here

Jul 26 '23 16:07 Vini-1234

Same here

Jul 27 '23 01:07 jlrodriguezalvarado

same here after installing: pdfminer unstructured

Jul 27 '23 12:07 amitkalo

same

Aug 07 '23 16:08 leonhma

Solved through pip3 uninstall pdfminer pip3 uninstall pdfminer-six pip3 install pdfminer-six

Aug 07 '23 18:08 TanAidan

Solved through pip3 uninstall pdfminer pip3 uninstall pdfminer-six pip3 install pdfminer-six

import io

from pdfminer.converter import TextConverter, HTMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

pdf_path = "example.pdf"

i_f = open(pdf_path, 'rb')
resMgr = PDFResourceManager()
retData = io.StringIO()
codec = 'utf-8'

TxtConverter = HTMLConverter(resMgr,retData, laparams= LAParams(), codec = codec)

interpreter = PDFPageInterpreter(resMgr,TxtConverter)
for page in PDFPage.get_pages(i_f):
    interpreter.process_page(page)

html_tags = retData.getvalue()
print(html_tags)

ValueError                                Traceback (most recent call last)
Cell In[23], line 17
     14 retData = io.StringIO()
     15 codec = 'utf-8'
---> 17 TxtConverter = HTMLConverter(resMgr,retData, laparams= LAParams(), codec = codec)
     19 interpreter = PDFPageInterpreter(resMgr,TxtConverter)
     20 for page in PDFPage.get_pages(i_f):

File [~/anaconda3/lib/python3.10/site-packages/pdfminer/converter.py:393](https://file+.vscode-resource.vscode-cdn.net/Users/L055797/Library/CloudStorage/OneDrive-EliLillyandCompany/python_notebooks/~/anaconda3/lib/python3.10/site-packages/pdfminer/converter.py:393), in HTMLConverter.__init__(self, rsrcmgr, outfp, codec, pageno, laparams, scale, fontscale, layoutmode, showpageno, pagemargin, imagewriter, debug, rect_colors, text_colors)
    391 # write() assumes a codec for binary I/O, or no codec for text I/O.
    392 if self.outfp_binary == (not self.codec):
--> 393     raise ValueError("Codec is required for a binary I/O output")
    395 if text_colors is None:
    396     text_colors = {"char": "black"}

ValueError: Codec is required for a binary I/O output```

Aug 08 '23 12:08 chiragsanghvi1

Solved :)

Replace io.StringIO with this

from io import BytesIO
....
retData = io.BytesIO()
....

Aug 08 '23 14:08 chiragsanghvi1

Ran into the same issue but couldn't fix it. Finally had to go back to the previous version [20220524].

Nov 01 '23 16:11 mfonolleda

pip3 uninstall pdfminer pip3 uninstall pdfminer-six pip3 install pdfminer-six

If it doesn't work, you need to restart your computer / mac, it works for me. Even though I don't know why, rebooting solves everything :)

Mar 05 '24 01:03 xuzijie1995

pdfminer.six pdfminer.six copied to clipboard

Getting import error

pdfminer.six
pdfminer.six copied to clipboard