tessdata_best icon indicating copy to clipboard operation
tessdata_best copied to clipboard

convert eng training to h5 model

Open ehrenmann1977 opened this issue 3 years ago • 3 comments
trafficstars

how to export a Keras model of English language? is it possible to export the corpus to do some neural network training using it? I mean something like MNIST dataset

ehrenmann1977 avatar Apr 02 '22 08:04 ehrenmann1977

Good question. Tesseract uses its own model file format. But it should be possible to convert the included neural network to any other model format which supports the same network specification.

We still have to find someone who wants to implement that (and also the other direction).

stweil avatar May 17 '23 18:05 stweil

Is there any documentation available on the model file format Tesseract uses (*.traineddata file format specification)?

stefan6419846 avatar May 19 '23 12:05 stefan6419846

There exists a command line tool combine_tessdata which can list and extract all components from a model file:

% combine_tessdata -d /opt/homebrew/share/tessdata/eng.traineddata 
Version:4.00.00alpha:eng:synth20170629
17:lstm:size=401636, offset=192
18:lstm-punc-dawg:size=4322, offset=401828
19:lstm-word-dawg:size=3694794, offset=406150
20:lstm-number-dawg:size=4738, offset=4100944
21:lstm-unicharset:size=6360, offset=4105682
22:lstm-recoder:size=1012, offset=4112042
23:version:size=30, offset=4113054

Another tool dawg2wordlist can convert the dawg components to normal text files, and the unicharset is already text. That's the easy part.

The interesting part is the lstm component with the neural network. It's not documented, so the program code is the reference for it. Look for DeSerialize in the lstm code.

stweil avatar May 19 '23 16:05 stweil