ruby-tesseract-ocr
ruby-tesseract-ocr copied to clipboard
Can't load OSD data
Hey meh,
Just looking for some quick advice. I've managed to get ruby-tesseract-ocr working with page_segmentation_mode 1 on ubuntu (12.04) and the OSD trained data.
I'm having trouble doing the same on OSX (mavericks) unfortunately. I've got tesseract installed via homebrew, and despite the fact that I can use the default tesseract CLI wrapper to extract text using the OSD data, i can't manage the same using ruby-tesseract-ocr. The tesseract CLI has a --list-langs options which displays "osd" as one of the options.
Despite that, this keeps happening:
2.1.0 :010 > tesseract = Tesseract::Engine.new{ |e|
2.1.0 :011 > e.language = LANGUAGE
2.1.0 :012?> e.page_segmentation_mode = 1
2.1.0 :013?> }
=> #<Tesseract::Engine:0x00000101aed998 @api=#<Tesseract::API:0x00000101aed830 @internal=#<FFI::AutoPointer address=0x00000102dae230>>, @initializing=false, @init=#<Proc:0x00000101aed948@(irb):10>, @path=".", @language=:ukr, @mode=:DEFAULT, @variables={}, @config=[], @rectangle=[], @psm=1>
2.1.0 :014 > blocks = tesseract.blocks_for(sideways)
Failed loading language 'osd'
Tesseract couldn't load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load
=> [#<Tesseract::Block(61.34318161010742): "шзешшцш:\n\n">, #<Tesseract::Block(63.370262145996094): "шшёо\n\n">, #<Tesseract::Block(60.00260543823242): "...Е .пьшцс ю\xD1сюоцюм Еьозцоьао\xD1\n\n">, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(62.234920501708984): "„м8\n\nюоы .ЦЕнЦФЕ „лноіг\xD1ёоцгьь юшозаоьаои\nщит: .шЕ дозцоьао.. о: оьодош\nтої ...Еь .шьшцс юЕБоцшм штбзцоьао:\n\nтот ...чоьнчгп „льоіьъдёоцъць шшозцоьао:\n\nыююю ...гг . .дюоц \xD1:..==ш...:оЕ льоїашш мшозцоьао: оьолош\nтм. .ІЕ „\xD1шьтаьзш \xD1зтз\xD1юоцзшо\xD1дцшьшм ш шьшц: ш\xD1ьюоцшм\n\nю .цоькцец .\xD1юьшаьзш \xD1зтз\xD1юоцзщошдцшьшм ш льосЕдёоцёць\n\nїю .ІВ „зьтцьзш _т::юоц:шо_._д:ш._шю\n\nї З ь ч _ ю ы г\n\n">, #<Tesseract::Block(48.451499938964844): "ттылчоцлїцлёщ\n\n">, #<Tesseract::Block(49.25178146362305): "3.93 ю-у_ш< о\xD1шцсёо\xD1 :::ёшцьоцсш\n\n">, #<Tesseract::Block(0.0): nil>]
Do you have any advice as to whether i'm missing a config thing somewhere? I'm mostly perplexed because, as far as i can tell, the data is in the right place, and everything else works (no compile errors or anything either).
This is a duplicate of #23, but not having an OS X system prevents me from doing any debugging in regard to that.
Sincerely I think it's an issue with how tesseract-ocr is compiled on OS X since the library doesn't export anything to define load paths from what I recall.
In short, the only advice I have is to look carefully at the configure options when building tesseract-ocr and hope for the best.
Alrighty, thanks @meh, i'll try to take a poke around homebrew's tesseract recipe. The thing that i don't quite get is how the PSM settings look for the osd.traineddata in a manner that's different than the main mechanism for loading language training data (since i'm ocring non-english documents just fine).
If you look at the examples/nerdz-captcha-breaker/break.rb source, it doesn't do any path fiddling, it basically just looks for tessdata in the same directory the script is ran from.
This means the load paths for language files are a compile time option.
EDIT: wait, it actually does a Tesseract.prefix = './', guess that's what should be done if you have your language files in different directories from the standard ones.
@knowtheory curious if you had any luck digging into the OSx related issues? Would be nice to play with the other Tesseract configs, like segmentation mode and custom configs. But I get the same errors mentioned above. Not sure I have the experience to help debug, but I'll probably give it a shot if I have time.
@bwinterling unfortunately, no i haven't had time to dig in :(