pdfminer.six
pdfminer.six copied to clipboard
Getting import error
ImportError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pdfminer/high_level.py in
ImportError: cannot import name 'HOCRConverter' from 'pdfminer.converter' (/usr/local/lib/python3.10/dist-packages/pdfminer/converter.py)
Having the same issue
from pdfminer.high_level import extract_text
extract_text(file_path)
Got an error:
ImportError: cannot import name 'HOCRConverter' from 'pdfminer.converter' (/usr/local/lib/python3.10/site-packages/pdfminer/converter.py)
same here
Same here
Same here
same here after installing: pdfminer unstructured
same
Solved through pip3 uninstall pdfminer pip3 uninstall pdfminer-six pip3 install pdfminer-six
Solved through pip3 uninstall pdfminer pip3 uninstall pdfminer-six pip3 install pdfminer-six
import io
from pdfminer.converter import TextConverter, HTMLConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
pdf_path = "example.pdf"
i_f = open(pdf_path, 'rb')
resMgr = PDFResourceManager()
retData = io.StringIO()
codec = 'utf-8'
TxtConverter = HTMLConverter(resMgr,retData, laparams= LAParams(), codec = codec)
interpreter = PDFPageInterpreter(resMgr,TxtConverter)
for page in PDFPage.get_pages(i_f):
interpreter.process_page(page)
html_tags = retData.getvalue()
print(html_tags)
ValueError Traceback (most recent call last)
Cell In[23], line 17
14 retData = io.StringIO()
15 codec = 'utf-8'
---> 17 TxtConverter = HTMLConverter(resMgr,retData, laparams= LAParams(), codec = codec)
19 interpreter = PDFPageInterpreter(resMgr,TxtConverter)
20 for page in PDFPage.get_pages(i_f):
File [~/anaconda3/lib/python3.10/site-packages/pdfminer/converter.py:393](https://file+.vscode-resource.vscode-cdn.net/Users/L055797/Library/CloudStorage/OneDrive-EliLillyandCompany/python_notebooks/~/anaconda3/lib/python3.10/site-packages/pdfminer/converter.py:393), in HTMLConverter.__init__(self, rsrcmgr, outfp, codec, pageno, laparams, scale, fontscale, layoutmode, showpageno, pagemargin, imagewriter, debug, rect_colors, text_colors)
391 # write() assumes a codec for binary I/O, or no codec for text I/O.
392 if self.outfp_binary == (not self.codec):
--> 393 raise ValueError("Codec is required for a binary I/O output")
395 if text_colors is None:
396 text_colors = {"char": "black"}
ValueError: Codec is required for a binary I/O output```
Solved :)
Replace io.StringIO with this
from io import BytesIO
....
retData = io.BytesIO()
....
Ran into the same issue but couldn't fix it. Finally had to go back to the previous version [20220524].
pip3 uninstall pdfminer pip3 uninstall pdfminer-six pip3 install pdfminer-six
If it doesn't work, you need to restart your computer / mac, it works for me. Even though I don't know why, rebooting solves everything :)