tesserocr
tesserocr copied to clipboard
OSX: RuntimeError: Failed to init API, possibly an invalid tessdata path
on OSX, I'm getting error when using other language. Here are all info I can get. Do you have any idea why this fails?
- PIP list
Pillow (5.1.0)
tesserocr (2.2.2)
- tesseract --version
# installed by brew install tesseract --with-all-languages
tesseract 3.05.01
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
- test.py and output
import tesserocr
from PIL import Image
print(tesserocr.tesseract_version())
print(tesserocr.get_languages())
image = Image.open('DSCF1896.jpg')
print(tesserocr.image_to_text(image, lang='kor'))
- output of test.py
tesseract 3.05.01
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
('/usr/local/Cellar/tesseract/3.05.01/share/tessdata/', ['ori', 'por', 'srp', 'hin', 'chi_sim', 'spa', 'uzb_cyrl', 'mar', 'swa', 'ces', 'urd', 'nep', 'cat', 'mya', 'lit', 'dan', 'mlt', 'enm', 'bod', 'tir', 'tgl', 'tha', 'fas', 'hrv', 'ukr', 'lao', 'ben', 'eus', 'eng', 'dzo', 'nld', 'vie', 'ita', 'kir', 'pus', 'msa', 'heb', 'slv', 'kaz', 'fin', 'yid', 'deu', 'bul', 'khm', 'ell', 'cym', 'kor', 'slk_frak', 'lav', 'mkd', 'glg', 'sin', 'syr', 'rus', 'kat', 'frk', 'kur', 'bos', 'ind', 'swe', 'est', 'iku', 'sqi', 'nor', 'pol', 'tam', 'mal', 'slk', 'jav', 'srp_latn', 'osd', 'afr', 'hat', 'gle', 'ron', 'kan', 'uig', 'lat', 'ita_old', 'frm', 'equ', 'tgk', 'kat_old', 'spa_old', 'uzb', 'dan_frak', 'hun', 'aze_cyrl', 'isl', 'grc', 'aze', 'asm', 'pan', 'epo', 'chi_tra', 'tel', 'deu_frak', 'amh', 'chr', 'guj', 'ara', 'san', 'fra', 'tur', 'jpn', 'ceb', 'bel'])
Traceback (most recent call last):
File "test.py", line 13, in <module>
print(tesserocr.image_to_text(image, lang='kor', path=cpath))
File "tesserocr.pyx", line 2400, in tesserocr.image_to_text
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/Cellar/tesseract/3.05.01/share/
- training data
ls /usr/local/Cellar/tesseract/3.05.01/share/tessdata/
afr.traineddata dan_frak.traineddata fra.cube.word-freq ita.cube.nn nep.traineddata spa.cube.params
amh.traineddata deu.traineddata fra.tesseract_cube.nn ita.cube.params nld.traineddata spa.cube.size
ara.cube.bigrams deu_frak.traineddata fra.traineddata ita.cube.size nor.traineddata spa.cube.word-freq
ara.cube.fold dzo.traineddata frk.traineddata ita.cube.word-freq ori.traineddata spa.traineddata
ara.cube.lm ell.traineddata frm.traineddata ita.tesseract_cube.nn osd.traineddata spa_old.traineddata
ara.cube.nn eng.cube.bigrams gle.traineddata ita.traineddata pan.traineddata sqi.traineddata
ara.cube.params eng.cube.fold glg.traineddata ita_old.traineddata pdf.ttf srp.traineddata
ara.cube.size eng.cube.lm grc.traineddata jav.traineddata pol.traineddata srp_latn.traineddata
ara.cube.word-freq eng.cube.nn guj.traineddata jpn.traineddata por.traineddata swa.traineddata
ara.traineddata eng.cube.params hat.traineddata kan.traineddata pus.traineddata swe.traineddata
asm.traineddata eng.cube.size heb.traineddata kat.traineddata ron.traineddata syr.traineddata
aze.traineddata eng.cube.word-freq hin.cube.bigrams kat_old.traineddata rus.cube.fold tam.traineddata
aze_cyrl.traineddata eng.tesseract_cube.nn hin.cube.fold kaz.traineddata rus.cube.lm tel.traineddata
bel.traineddata eng.traineddata hin.cube.lm khm.traineddata rus.cube.nn tessconfigs
ben.traineddata enm.traineddata hin.cube.nn kir.traineddata rus.cube.params tgk.traineddata
bod.traineddata epo.traineddata hin.cube.params kor.traineddata rus.cube.size tgl.traineddata
bos.traineddata equ.traineddata hin.cube.word-freq kur.traineddata rus.cube.word-freq tha.traineddata
bul.traineddata est.traineddata hin.tesseract_cube.nn lao.traineddata rus.traineddata tir.traineddata
cat.traineddata eus.traineddata hin.traineddata lat.traineddata san.traineddata tur.traineddata
ceb.traineddata fas.traineddata hrv.traineddata lav.traineddata sin.traineddata uig.traineddata
ces.traineddata fin.traineddata hun.traineddata lit.traineddata slk.traineddata ukr.traineddata
chi_sim.traineddata fra.cube.bigrams iku.traineddata mal.traineddata slk_frak.traineddata urd.traineddata
chi_tra.traineddata fra.cube.fold ind.traineddata mar.traineddata slv.traineddata uzb.traineddata
chr.traineddata fra.cube.lm isl.traineddata mkd.traineddata spa.cube.bigrams uzb_cyrl.traineddata
configs fra.cube.nn ita.cube.bigrams mlt.traineddata spa.cube.fold vie.traineddata
cym.traineddata fra.cube.params ita.cube.fold msa.traineddata spa.cube.lm yid.traineddata
dan.traineddata fra.cube.size ita.cube.lm mya.traineddata spa.cube.nn
I did something like this on my win10 and succeed,i guess it may be effective on linux, good luck! vim /etc/profile export TESSDATA_PREFIX="/usr/local/Cellar/tesseract/3.05.01/share/" //Add at the end of etc/profile source /etc/profile //Refresh environment variables @Gatsby-Lee
@moucmou Thank you for your suggestion. I tried again after setting env variable like what you did, TESSDATA_PREFIX="/usr/local/Cellar/tesseract/3.05.01/share/" However, I still see same error.
I will try again with your approach.
There seems to be a problem with the tessdata path. In your logs, the output of get_languages()
shows the tessdata path as /usr/local/Cellar/tesseract/3.05.01/share/tessdata/
while the RuntimeError
shows it as /usr/local/Cellar/tesseract/3.05.01/share/
. Can you try setting your TESSDATA_PREFIX
environment to /usr/local/Cellar/tesseract/3.05.01/share/tessdata/
instead? You should also be able to initialize the API with that path using:
api = tesserocr.PyTessBaseAPI(path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata')
@sirfz
I tried both approaches you mentioned.
- setting
TESSDATA_PREFIX
- Init
PyTessBaseAPI
with path and lang
However, both of them didn't work so far. Here is the output I got.
python test.py
tesseract 3.05.01
leptonica-1.75.3
libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
('/usr/local/Cellar/tesseract/3.05.01/share/tessdata/', ['ori', 'por', 'srp', 'hin', 'chi_sim', 'spa', 'uzb_cyrl', 'mar', 'swa', 'ces', 'urd', 'nep', 'cat', 'mya', 'lit', 'dan', 'mlt', 'enm', 'bod', 'tir', 'tgl', 'tha', 'fas', 'hrv', 'ukr', 'lao', 'ben', 'eus', 'eng', 'dzo', 'nld', 'vie', 'ita', 'kir', 'pus', 'msa', 'heb', 'slv', 'kaz', 'fin', 'yid', 'deu', 'bul', 'khm', 'ell', 'cym', 'kor', 'slk_frak', 'lav', 'mkd', 'glg', 'sin', 'syr', 'rus', 'kat', 'frk', 'kur', 'bos', 'ind', 'swe', 'est', 'iku', 'sqi', 'nor', 'pol', 'tam', 'mal', 'slk', 'jav', 'srp_latn', 'osd', 'afr', 'hat', 'gle', 'ron', 'kan', 'uig', 'lat', 'ita_old', 'frm', 'equ', 'tgk', 'kat_old', 'spa_old', 'uzb', 'dan_frak', 'hun', 'aze_cyrl', 'isl', 'grc', 'aze', 'asm', 'pan', 'epo', 'chi_tra', 'tel', 'deu_frak', 'amh', 'chr', 'guj', 'ara', 'san', 'fra', 'tur', 'jpn', 'ceb', 'bel'])
Traceback (most recent call last):
File "test.py", line 18, in <module>
api = tesserocr.PyTessBaseAPI(lang='kor', path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata/')
File "tesserocr.pyx", line 1144, in tesserocr.PyTessBaseAPI.__cinit__
File "tesserocr.pyx", line 1157, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/Cellar/tesseract/3.05.01/share/tessdata/
Here is the test code I used.
import tesserocr
from PIL import Image
image = Image.open('test.jpg')
api = tesserocr.PyTessBaseAPI(lang='kor', path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata/')
api.SetImage(image)
print(api.GetUTF8Text())
api.End()
Thank you for your suggestion.
Have you tried other language files than kor
?
@sirfz
interesting results.
When I tried with latin character language, the error is different from others like jpn
- with
ita
Traceback (most recent call last):
File "test.py", line 24, in <module>
print(api.GetUTF8Text())
File "tesserocr.pyx", line 2105, in tesserocr.PyTessBaseAPI.GetUTF8Text
File "tesserocr.pyx", line 311, in tesserocr._free_str
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 27: invalid continuation byte
- with
jpn
Traceback (most recent call last):
File "test.py", line 22, in <module>
api = tesserocr.PyTessBaseAPI(lang='jpn', path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata/')
File "tesserocr.pyx", line 1144, in tesserocr.PyTessBaseAPI.__cinit__
File "tesserocr.pyx", line 1157, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/Cellar/tesseract/3.05.01/share/tessdata/
Where did you download your tessdata files from? I see 3.05.01
in your path but tessdata for tesseract v3.05.01 is released under v3.04.00 (check releases). Can you try to download that release and test again? (in case that's not what you're already using).
@sirfz sorry for late response. The tessdata is downloaded by brew I think. I will try again and leave the output. Thank you
I have the same problem. Have you solved it? Thank you RuntimeError: Failed to init API, possibly an invalid tessdata path: C:\Tesseract-OCR\tessdata
I met the same issue. And I found run at Eclipse environment will be OK. What's different between run @Eclipse and run @Terminal ?
test 1: lang = 'eng' is OK but, lang = 'chi_sim' will meet this issue, and still OK @Eclipse environment. How ?
test 2: rename eng.traineddata to chi_sim.traineddata , then , test it again It's OK so, this maybe caused by the download chi_sim.traineddata ? how to fix it?
finally, I got the solution for python:
add below code on your python code.
import locale locale.setlocale(locale.LC_ALL, "C")
And, if you have two or more version of tesseract you do need set 'TESSDATA_PREFIX' to the proper one.
thank you, I've solved this problem by a hard way Switch to the folder to:C:\Program Files\Python36\Lib\site-packages\tesserocr `import tesserocr import os from PIL import Image
os.chdir(r"C:\Program Files\Python36\Lib\site-packages\tesserocr") image = Image.open('image.png') print(tesserocr.image_to_text(image)) `
But,something tough are: Every time you have to switch the directory to tesserocr.
@kenkuang @HYSWZW Thank you for sharing solution you used. I will try as well. :)
hi ,Is it solved ?
@kenkuang ken very nice! it's good, i fixed this question!
hi,@Gatsby-Lee @HYSWZW . I have try what you said above, but some error came, can you help me ?
-
OS win10
-
PIP list tesserocr 2.4.0 Pillow 5.4.1
-
code import tesserocr from PIL import Image import os
print(tesserocr.tesseract_version()) print(tesserocr.get_languages()) os.chdir(r"D:\Programs\Python36\Lib\site-packages\tesserocr") image = Image.open('code2.jpg') result = tesserocr.image_to_text(image) print(result)
- output
tesseract 4.0.0
leptonica-1.76.0 (Jan 8 2019, 13:41:57) [MSC v.1900 LIB Release x64]
libgif 5.1.4 : libjpeg 9b : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
('D:\Programs\Python36\/tessdata/', [])
Traceback (most recent call last):
File "D:/jupyternotebook/Cuiqingcai_scraw/python3_scraw/eightchapter/pic.py", line 9, in
result = tesserocr.image_to_text(image) File "tesserocr.pyx", line 2443, in tesserocr._tesserocr.image_to_text RuntimeError: Failed to init API, possibly an invalid tessdata path: D:\Programs\Python36/tessdata/
hi,@Gatsby-Lee @HYSWZW . I have try what you said above, but some error came, can you help me ?
- OS win10
- PIP list tesserocr 2.4.0 Pillow 5.4.1
- code import tesserocr from PIL import Image import os
print(tesserocr.tesseract_version()) print(tesserocr.get_languages()) os.chdir(r"D:\Programs\Python36\Lib\site-packages\tesserocr") image = Image.open('code2.jpg') result = tesserocr.image_to_text(image) print(result)
- output tesseract 4.0.0 leptonica-1.76.0 (Jan 8 2019, 13:41:57) [MSC v.1900 LIB Release x64] libgif 5.1.4 : libjpeg 9b : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 ('D:\Programs\Python36/tessdata/', []) Traceback (most recent call last): File "D:/jupyternotebook/Cuiqingcai_scraw/python3_scraw/eightchapter/pic.py", line 9, in result = tesserocr.image_to_text(image) File "tesserocr.pyx", line 2443, in tesserocr._tesserocr.image_to_text RuntimeError: Failed to init API, possibly an invalid tessdata path: D:\Programs\Python36/tessdata/
Same error...