tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

OSX: RuntimeError: Failed to init API, possibly an invalid tessdata path

Open Gatsby-Lee opened this issue 6 years ago • 16 comments

on OSX, I'm getting error when using other language. Here are all info I can get. Do you have any idea why this fails?

  • PIP list
Pillow (5.1.0)
tesserocr (2.2.2)
  • tesseract --version
# installed by brew install tesseract --with-all-languages
tesseract 3.05.01
 leptonica-1.75.3
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11
  • test.py and output
import tesserocr
from PIL import Image

print(tesserocr.tesseract_version())
print(tesserocr.get_languages())
image = Image.open('DSCF1896.jpg')
print(tesserocr.image_to_text(image, lang='kor'))
  • output of test.py
tesseract 3.05.01
 leptonica-1.75.3
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

('/usr/local/Cellar/tesseract/3.05.01/share/tessdata/', ['ori', 'por', 'srp', 'hin', 'chi_sim', 'spa', 'uzb_cyrl', 'mar', 'swa', 'ces', 'urd', 'nep', 'cat', 'mya', 'lit', 'dan', 'mlt', 'enm', 'bod', 'tir', 'tgl', 'tha', 'fas', 'hrv', 'ukr', 'lao', 'ben', 'eus', 'eng', 'dzo', 'nld', 'vie', 'ita', 'kir', 'pus', 'msa', 'heb', 'slv', 'kaz', 'fin', 'yid', 'deu', 'bul', 'khm', 'ell', 'cym', 'kor', 'slk_frak', 'lav', 'mkd', 'glg', 'sin', 'syr', 'rus', 'kat', 'frk', 'kur', 'bos', 'ind', 'swe', 'est', 'iku', 'sqi', 'nor', 'pol', 'tam', 'mal', 'slk', 'jav', 'srp_latn', 'osd', 'afr', 'hat', 'gle', 'ron', 'kan', 'uig', 'lat', 'ita_old', 'frm', 'equ', 'tgk', 'kat_old', 'spa_old', 'uzb', 'dan_frak', 'hun', 'aze_cyrl', 'isl', 'grc', 'aze', 'asm', 'pan', 'epo', 'chi_tra', 'tel', 'deu_frak', 'amh', 'chr', 'guj', 'ara', 'san', 'fra', 'tur', 'jpn', 'ceb', 'bel'])
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    print(tesserocr.image_to_text(image, lang='kor', path=cpath))
  File "tesserocr.pyx", line 2400, in tesserocr.image_to_text
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/Cellar/tesseract/3.05.01/share/
  • training data
ls /usr/local/Cellar/tesseract/3.05.01/share/tessdata/    
afr.traineddata       dan_frak.traineddata  fra.cube.word-freq    ita.cube.nn           nep.traineddata       spa.cube.params
amh.traineddata       deu.traineddata       fra.tesseract_cube.nn ita.cube.params       nld.traineddata       spa.cube.size
ara.cube.bigrams      deu_frak.traineddata  fra.traineddata       ita.cube.size         nor.traineddata       spa.cube.word-freq
ara.cube.fold         dzo.traineddata       frk.traineddata       ita.cube.word-freq    ori.traineddata       spa.traineddata
ara.cube.lm           ell.traineddata       frm.traineddata       ita.tesseract_cube.nn osd.traineddata       spa_old.traineddata
ara.cube.nn           eng.cube.bigrams      gle.traineddata       ita.traineddata       pan.traineddata       sqi.traineddata
ara.cube.params       eng.cube.fold         glg.traineddata       ita_old.traineddata   pdf.ttf               srp.traineddata
ara.cube.size         eng.cube.lm           grc.traineddata       jav.traineddata       pol.traineddata       srp_latn.traineddata
ara.cube.word-freq    eng.cube.nn           guj.traineddata       jpn.traineddata       por.traineddata       swa.traineddata
ara.traineddata       eng.cube.params       hat.traineddata       kan.traineddata       pus.traineddata       swe.traineddata
asm.traineddata       eng.cube.size         heb.traineddata       kat.traineddata       ron.traineddata       syr.traineddata
aze.traineddata       eng.cube.word-freq    hin.cube.bigrams      kat_old.traineddata   rus.cube.fold         tam.traineddata
aze_cyrl.traineddata  eng.tesseract_cube.nn hin.cube.fold         kaz.traineddata       rus.cube.lm           tel.traineddata
bel.traineddata       eng.traineddata       hin.cube.lm           khm.traineddata       rus.cube.nn           tessconfigs
ben.traineddata       enm.traineddata       hin.cube.nn           kir.traineddata       rus.cube.params       tgk.traineddata
bod.traineddata       epo.traineddata       hin.cube.params       kor.traineddata       rus.cube.size         tgl.traineddata
bos.traineddata       equ.traineddata       hin.cube.word-freq    kur.traineddata       rus.cube.word-freq    tha.traineddata
bul.traineddata       est.traineddata       hin.tesseract_cube.nn lao.traineddata       rus.traineddata       tir.traineddata
cat.traineddata       eus.traineddata       hin.traineddata       lat.traineddata       san.traineddata       tur.traineddata
ceb.traineddata       fas.traineddata       hrv.traineddata       lav.traineddata       sin.traineddata       uig.traineddata
ces.traineddata       fin.traineddata       hun.traineddata       lit.traineddata       slk.traineddata       ukr.traineddata
chi_sim.traineddata   fra.cube.bigrams      iku.traineddata       mal.traineddata       slk_frak.traineddata  urd.traineddata
chi_tra.traineddata   fra.cube.fold         ind.traineddata       mar.traineddata       slv.traineddata       uzb.traineddata
chr.traineddata       fra.cube.lm           isl.traineddata       mkd.traineddata       spa.cube.bigrams      uzb_cyrl.traineddata
configs               fra.cube.nn           ita.cube.bigrams      mlt.traineddata       spa.cube.fold         vie.traineddata
cym.traineddata       fra.cube.params       ita.cube.fold         msa.traineddata       spa.cube.lm           yid.traineddata
dan.traineddata       fra.cube.size         ita.cube.lm           mya.traineddata       spa.cube.nn

Gatsby-Lee avatar Apr 05 '18 05:04 Gatsby-Lee

I did something like this on my win10 and succeed,i guess it may be effective on linux, good luck! vim /etc/profile export TESSDATA_PREFIX="/usr/local/Cellar/tesseract/3.05.01/share/" //Add at the end of etc/profile source /etc/profile //Refresh environment variables @Gatsby-Lee

moucmou avatar Apr 08 '18 07:04 moucmou

@moucmou Thank you for your suggestion. I tried again after setting env variable like what you did, TESSDATA_PREFIX="/usr/local/Cellar/tesseract/3.05.01/share/" However, I still see same error.

I will try again with your approach.

Gatsby-Lee avatar Apr 08 '18 22:04 Gatsby-Lee

There seems to be a problem with the tessdata path. In your logs, the output of get_languages() shows the tessdata path as /usr/local/Cellar/tesseract/3.05.01/share/tessdata/ while the RuntimeError shows it as /usr/local/Cellar/tesseract/3.05.01/share/. Can you try setting your TESSDATA_PREFIX environment to /usr/local/Cellar/tesseract/3.05.01/share/tessdata/ instead? You should also be able to initialize the API with that path using:

api = tesserocr.PyTessBaseAPI(path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata')

sirfz avatar Apr 09 '18 13:04 sirfz

@sirfz

I tried both approaches you mentioned.

  • setting TESSDATA_PREFIX
  • Init PyTessBaseAPI with path and lang

However, both of them didn't work so far. Here is the output I got.

python test.py 
tesseract 3.05.01
 leptonica-1.75.3
  libjpeg 9c : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11

('/usr/local/Cellar/tesseract/3.05.01/share/tessdata/', ['ori', 'por', 'srp', 'hin', 'chi_sim', 'spa', 'uzb_cyrl', 'mar', 'swa', 'ces', 'urd', 'nep', 'cat', 'mya', 'lit', 'dan', 'mlt', 'enm', 'bod', 'tir', 'tgl', 'tha', 'fas', 'hrv', 'ukr', 'lao', 'ben', 'eus', 'eng', 'dzo', 'nld', 'vie', 'ita', 'kir', 'pus', 'msa', 'heb', 'slv', 'kaz', 'fin', 'yid', 'deu', 'bul', 'khm', 'ell', 'cym', 'kor', 'slk_frak', 'lav', 'mkd', 'glg', 'sin', 'syr', 'rus', 'kat', 'frk', 'kur', 'bos', 'ind', 'swe', 'est', 'iku', 'sqi', 'nor', 'pol', 'tam', 'mal', 'slk', 'jav', 'srp_latn', 'osd', 'afr', 'hat', 'gle', 'ron', 'kan', 'uig', 'lat', 'ita_old', 'frm', 'equ', 'tgk', 'kat_old', 'spa_old', 'uzb', 'dan_frak', 'hun', 'aze_cyrl', 'isl', 'grc', 'aze', 'asm', 'pan', 'epo', 'chi_tra', 'tel', 'deu_frak', 'amh', 'chr', 'guj', 'ara', 'san', 'fra', 'tur', 'jpn', 'ceb', 'bel'])
Traceback (most recent call last):
  File "test.py", line 18, in <module>
    api = tesserocr.PyTessBaseAPI(lang='kor', path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata/')
  File "tesserocr.pyx", line 1144, in tesserocr.PyTessBaseAPI.__cinit__
  File "tesserocr.pyx", line 1157, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/Cellar/tesseract/3.05.01/share/tessdata/

Here is the test code I used.

import tesserocr
from PIL import Image
image = Image.open('test.jpg')
api = tesserocr.PyTessBaseAPI(lang='kor', path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata/')
api.SetImage(image)
print(api.GetUTF8Text())
api.End()

Thank you for your suggestion.

Gatsby-Lee avatar Apr 09 '18 15:04 Gatsby-Lee

Have you tried other language files than kor?

sirfz avatar Apr 09 '18 15:04 sirfz

@sirfz

interesting results. When I tried with latin character language, the error is different from others like jpn

  • with ita
Traceback (most recent call last):
  File "test.py", line 24, in <module>
    print(api.GetUTF8Text())
  File "tesserocr.pyx", line 2105, in tesserocr.PyTessBaseAPI.GetUTF8Text
  File "tesserocr.pyx", line 311, in tesserocr._free_str
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 27: invalid continuation byte
  • with jpn
Traceback (most recent call last):
  File "test.py", line 22, in <module>
    api = tesserocr.PyTessBaseAPI(lang='jpn', path='/usr/local/Cellar/tesseract/3.05.01/share/tessdata/')
  File "tesserocr.pyx", line 1144, in tesserocr.PyTessBaseAPI.__cinit__
  File "tesserocr.pyx", line 1157, in tesserocr.PyTessBaseAPI._init_api
RuntimeError: Failed to init API, possibly an invalid tessdata path: /usr/local/Cellar/tesseract/3.05.01/share/tessdata/

Gatsby-Lee avatar Apr 09 '18 15:04 Gatsby-Lee

Where did you download your tessdata files from? I see 3.05.01 in your path but tessdata for tesseract v3.05.01 is released under v3.04.00 (check releases). Can you try to download that release and test again? (in case that's not what you're already using).

sirfz avatar Apr 11 '18 14:04 sirfz

@sirfz sorry for late response. The tessdata is downloaded by brew I think. I will try again and leave the output. Thank you

Gatsby-Lee avatar Apr 16 '18 15:04 Gatsby-Lee

I have the same problem. Have you solved it? Thank you RuntimeError: Failed to init API, possibly an invalid tessdata path: C:\Tesseract-OCR\tessdata

HYSWZW avatar May 14 '18 06:05 HYSWZW

I met the same issue. And I found run at Eclipse environment will be OK. What's different between run @Eclipse and run @Terminal ?


test 1: lang = 'eng' is OK but, lang = 'chi_sim' will meet this issue, and still OK @Eclipse environment. How ?


test 2: rename eng.traineddata to chi_sim.traineddata , then , test it again It's OK so, this maybe caused by the download chi_sim.traineddata ? how to fix it?


finally, I got the solution for python:

add below code on your python code.

import locale locale.setlocale(locale.LC_ALL, "C")


And, if you have two or more version of tesseract you do need set 'TESSDATA_PREFIX' to the proper one.

kenkuang avatar May 16 '18 00:05 kenkuang

thank you, I've solved this problem by a hard way Switch to the folder to:C:\Program Files\Python36\Lib\site-packages\tesserocr `import tesserocr import os from PIL import Image

os.chdir(r"C:\Program Files\Python36\Lib\site-packages\tesserocr") image = Image.open('image.png') print(tesserocr.image_to_text(image)) `

But,something tough are: Every time you have to switch the directory to tesserocr.

HYSWZW avatar May 16 '18 02:05 HYSWZW

@kenkuang @HYSWZW Thank you for sharing solution you used. I will try as well. :)

Gatsby-Lee avatar May 20 '18 22:05 Gatsby-Lee

hi ,Is it solved ?

Mrh08512 avatar Jun 22 '18 06:06 Mrh08512

@kenkuang ken very nice! it's good, i fixed this question!

Mrh08512 avatar Jun 22 '18 06:06 Mrh08512

hi,@Gatsby-Lee @HYSWZW . I have try what you said above, but some error came, can you help me ?

  • OS win10

  • PIP list tesserocr 2.4.0 Pillow 5.4.1

  • code import tesserocr from PIL import Image import os

print(tesserocr.tesseract_version()) print(tesserocr.get_languages()) os.chdir(r"D:\Programs\Python36\Lib\site-packages\tesserocr") image = Image.open('code2.jpg') result = tesserocr.image_to_text(image) print(result)

  • output tesseract 4.0.0 leptonica-1.76.0 (Jan 8 2019, 13:41:57) [MSC v.1900 LIB Release x64] libgif 5.1.4 : libjpeg 9b : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 ('D:\Programs\Python36\/tessdata/', []) Traceback (most recent call last): File "D:/jupyternotebook/Cuiqingcai_scraw/python3_scraw/eightchapter/pic.py", line 9, in result = tesserocr.image_to_text(image) File "tesserocr.pyx", line 2443, in tesserocr._tesserocr.image_to_text RuntimeError: Failed to init API, possibly an invalid tessdata path: D:\Programs\Python36/tessdata/

qiangyu1990 avatar Mar 31 '19 11:03 qiangyu1990

hi,@Gatsby-Lee @HYSWZW . I have try what you said above, but some error came, can you help me ?

  • OS win10
  • PIP list tesserocr 2.4.0 Pillow 5.4.1
  • code import tesserocr from PIL import Image import os

print(tesserocr.tesseract_version()) print(tesserocr.get_languages()) os.chdir(r"D:\Programs\Python36\Lib\site-packages\tesserocr") image = Image.open('code2.jpg') result = tesserocr.image_to_text(image) print(result)

  • output tesseract 4.0.0 leptonica-1.76.0 (Jan 8 2019, 13:41:57) [MSC v.1900 LIB Release x64] libgif 5.1.4 : libjpeg 9b : libpng 1.6.35 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0 ('D:\Programs\Python36/tessdata/', []) Traceback (most recent call last): File "D:/jupyternotebook/Cuiqingcai_scraw/python3_scraw/eightchapter/pic.py", line 9, in result = tesserocr.image_to_text(image) File "tesserocr.pyx", line 2443, in tesserocr._tesserocr.image_to_text RuntimeError: Failed to init API, possibly an invalid tessdata path: D:\Programs\Python36/tessdata/

Same error...

zengzhanhang avatar Jan 02 '20 12:01 zengzhanhang