ruby-tesseract-ocr icon indicating copy to clipboard operation
ruby-tesseract-ocr copied to clipboard

Can't load OSD data

Open knowtheory opened this issue 11 years ago • 5 comments

Hey meh,

Just looking for some quick advice. I've managed to get ruby-tesseract-ocr working with page_segmentation_mode 1 on ubuntu (12.04) and the OSD trained data.

I'm having trouble doing the same on OSX (mavericks) unfortunately. I've got tesseract installed via homebrew, and despite the fact that I can use the default tesseract CLI wrapper to extract text using the OSD data, i can't manage the same using ruby-tesseract-ocr. The tesseract CLI has a --list-langs options which displays "osd" as one of the options.

Despite that, this keeps happening:

2.1.0 :010 >   tesseract = Tesseract::Engine.new{ |e| 
2.1.0 :011 >       e.language               = LANGUAGE
2.1.0 :012?>     e.page_segmentation_mode = 1
2.1.0 :013?>   }
 => #<Tesseract::Engine:0x00000101aed998 @api=#<Tesseract::API:0x00000101aed830 @internal=#<FFI::AutoPointer address=0x00000102dae230>>, @initializing=false, @init=#<Proc:0x00000101aed948@(irb):10>, @path=".", @language=:ukr, @mode=:DEFAULT, @variables={}, @config=[], @rectangle=[], @psm=1> 
2.1.0 :014 > blocks = tesseract.blocks_for(sideways)
Failed loading language 'osd'
Tesseract couldn't load any languages!
Warning: Auto orientation and script detection requested, but osd language failed to load
 => [#<Tesseract::Block(61.34318161010742): "шзешшцш:\n\n">, #<Tesseract::Block(63.370262145996094): "шшёо\n\n">, #<Tesseract::Block(60.00260543823242): "...Е .пьшцс ю\xD1сюоцюм Еьозцоьао\xD1\n\n">, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(0.0): nil>, #<Tesseract::Block(62.234920501708984): "„м8\n\nюоы .ЦЕнЦФЕ „лноіг\xD1ёоцгьь юшозаоьаои\nщит: .шЕ дозцоьао.. о: оьодош\nтої ...Еь .шьшцс юЕБоцшм штбзцоьао:\n\nтот ...чоьнчгп „льоіьъдёоцъць шшозцоьао:\n\nыююю ...гг . .дюоц \xD1:..==ш...:оЕ льоїашш мшозцоьао: оьолош\nтм. .ІЕ „\xD1шьтаьзш \xD1зтз\xD1юоцзшо\xD1дцшьшм ш шьшц: ш\xD1ьюоцшм\n\nю .цоькцец .\xD1юьшаьзш \xD1зтз\xD1юоцзщошдцшьшм ш льосЕдёоцёць\n\nїю .ІВ „зьтцьзш _т::юоц:шо_._д:ш._шю\n\nї З ь ч _ ю ы г\n\n">, #<Tesseract::Block(48.451499938964844): "ттылчоцлїцлёщ\n\n">, #<Tesseract::Block(49.25178146362305): "3.93 ю-у_ш< о\xD1шцсёо\xD1 :::ёшцьоцсш\n\n">, #<Tesseract::Block(0.0): nil>] 

Do you have any advice as to whether i'm missing a config thing somewhere? I'm mostly perplexed because, as far as i can tell, the data is in the right place, and everything else works (no compile errors or anything either).

knowtheory avatar Mar 07 '14 16:03 knowtheory

This is a duplicate of #23, but not having an OS X system prevents me from doing any debugging in regard to that.

Sincerely I think it's an issue with how tesseract-ocr is compiled on OS X since the library doesn't export anything to define load paths from what I recall.

In short, the only advice I have is to look carefully at the configure options when building tesseract-ocr and hope for the best.

meh avatar Mar 07 '14 16:03 meh

Alrighty, thanks @meh, i'll try to take a poke around homebrew's tesseract recipe. The thing that i don't quite get is how the PSM settings look for the osd.traineddata in a manner that's different than the main mechanism for loading language training data (since i'm ocring non-english documents just fine).

knowtheory avatar Mar 07 '14 16:03 knowtheory

If you look at the examples/nerdz-captcha-breaker/break.rb source, it doesn't do any path fiddling, it basically just looks for tessdata in the same directory the script is ran from.

This means the load paths for language files are a compile time option.

EDIT: wait, it actually does a Tesseract.prefix = './', guess that's what should be done if you have your language files in different directories from the standard ones.

meh avatar Mar 07 '14 16:03 meh

@knowtheory curious if you had any luck digging into the OSx related issues? Would be nice to play with the other Tesseract configs, like segmentation mode and custom configs. But I get the same errors mentioned above. Not sure I have the experience to help debug, but I'll probably give it a shot if I have time.

bwinterling avatar Apr 24 '14 18:04 bwinterling

@bwinterling unfortunately, no i haven't had time to dig in :(

knowtheory avatar Apr 24 '14 18:04 knowtheory