tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Deserialize header failed: lstmf files do not work across machines with different endianess

Open Shreeshrii opened this issue 4 years ago • 20 comments

The unittests need some lstmf files which are there in the repo https://github.com/tesseract-ocr/test/tree/master/testdata, which is used as a submodule in tesseract.

The unittests run ok on ppc64le (lttle endian) but fail with following error on ppc64 (big endian).

[ RUN      ] LSTMTrainerTest.EncodesEng
Config file is optional, continuing...
Failed to read data from: /home/shreeshrii/langdata_lstm//eng/eng.config
Warning: given outputs 1 not equal to unicharset of 112.
Num outputs,weights in Series:
  1,1,0,32:32, 0
  Lbx100:200, 106400
  Fc112:112, 22512
Total weights = 128912
Built network:[1,1,0,32Lbx100Fc112] from request [1,1,0,32 Lbx100 O1c1]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Deserialize header failed: /home/shreeshrii/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file ../../../src/ccstruct/imagedata.cpp, line 651
FAIL lstmtrainer_test (exit status: 133)
Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from LSTMTrainerTest
[ RUN      ] LSTMTrainerTest.RecodeTestKorBase
Config file is optional, continuing...
Warning: given outputs 1 not equal to unicharset of 836.
Num outputs,weights in Series:
  1,1,0,32:32, 0
  Lbx96:192, 99072
  Fc836:836, 161348
Total weights = 260420
Built network:[1,1,0,32Lbx96Fc836] from request [1,1,0,32 Lbx96 O1c1]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Deserialize header failed: /home/shreeshrii/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file ../../../src/ccstruct/imagedata.cpp, line 651
FAIL lstm_recode_test (exit status: 133)
uname -a

Linux rh-power-vm61.fit.vutbr.cz 4.16.3-301.fc28.ppc64 #1 SMP Mon Apr 23 21:44:46 UTC 2018 ppc64 ppc64 ppc64 GNU/Linux

tesseract -v

tesseract 5.0.0-alpha-319-g8dc3
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2

Shreeshrii avatar Jul 24 '19 04:07 Shreeshrii

They run ok on a different machine.

uname -a

Linux tesseract-ocr 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:54:50 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux

tesseract -v

tesseract 5.0.0-alpha-322-g74ac
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from LSTMTrainerTest
[ RUN      ] LSTMTrainerTest.RecodeTestKorBase
Config file is optional, continuing...
Warning: given outputs 1 not equal to unicharset of 836.
Num outputs,weights in Series:
  1,1,0,32:32, 0
  Lbx96:192, 99072
  Fc836:836, 161348
Total weights = 260420
Built network:[1,1,0,32Lbx96Fc836] from request [1,1,0,32 Lbx96 O1c1]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Loaded 464/464 lines (1-464) of document /home/ubuntu/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf
Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from LSTMTrainerTest
[ RUN      ] LSTMTrainerTest.EncodesEng
Config file is optional, continuing...
Failed to read data from: /home/ubuntu/langdata_lstm//eng/eng.config
Warning: given outputs 1 not equal to unicharset of 112.
Num outputs,weights in Series:
  1,1,0,32:32, 0
  Lbx100:200, 106400
  Fc112:112, 22512
Total weights = 128912
Built network:[1,1,0,32Lbx100Fc112] from request [1,1,0,32 Lbx100 O1c1]
Training parameters:
  Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Loaded 929/929 lines (1-929) of document /home/ubuntu/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
Config file is optional, continuing...
Failed to read data from: /home/ubuntu/langdata_lstm//eng/eng.config
Null char=2
Warning: given outputs 1 not equal to unicharset of 111.
Num outputs,weights in Series:
  1,1,0,32:32, 0
  Lbx100:200, 106400
  Fc111:111, 22311

...

Shreeshrii avatar Jul 24 '19 04:07 Shreeshrii

The file sizes on both machines are same.

ls -l /home/shreeshrii/tesseract/test/testdata/*.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii  737425 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/deu.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 1680435 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/eng.Arial.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 1511827 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii  756028 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/fra.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii  866492 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/kan.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii  979338 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf
 ls -l  /home/ubuntu/tesseract/test/testdata/*.lstmf
-rw-rw-r-- 1 ubuntu ubuntu  737425 Jul  9 09:57 /home/ubuntu/tesseract/test/testdata/deu.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 1680435 Jul  9 09:57 /home/ubuntu/tesseract/test/testdata/eng.Arial.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 1511827 Jul  9 09:57 /home/ubuntu/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu  756028 Jul  9 09:57 /home/ubuntu/tesseract/test/testdata/fra.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu  866492 Jul  9 09:57 /home/ubuntu/tesseract/test/testdata/kan.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu  979338 Jul  9 09:57 /home/ubuntu/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf

Shreeshrii avatar Jul 24 '19 04:07 Shreeshrii

lstmtraining --model_output="C:\Users\zhangtiehai\Desktop\new_6\output" --continue_from jpn.lstm --train_listfile=text.training_file.txt --traineddata="C:\Users\zhangtiehai\Desktop\new_6\jpn.traineddata" -max_iterations 800 -U text.unicharset Loaded file jpn.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from jpn.lstm Deserialize header failed: !'+,-01456789:?BCEHLSTUacdefilmnoprstuy?€??€?????€??€??€??€??€??€??€??€??€??€??€??€?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€????????????????????????????????????????????????????????????????????????????????????????????€??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€??€??€??€??€??€??€??€??€??€??€???????????????????????????€??????????????????????????????????????????????????€?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€?????????????????????????????????????????????????????????????????€???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€??????????????????????????????????????€??????????????????????????????????????????eyLoad of page 0 failed! Load of images failed!! Excuse me,Why and How to slove the problem?

zthdsb avatar Jul 30 '19 08:07 zthdsb

Please share the files you used for testing. Thanks!

Shreeshrii avatar Jul 30 '19 08:07 Shreeshrii

!'+,-01456789:?BCEHLSTUacdefilmnoprstuy―…→、。々〇《》「」『』【】ぁあぃいうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろわをん゛ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチッツテデトドナニネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロワンヴヶー一丁七万丈三上下不与世丘両並中串丸丹主久之乏乖乗乙九乞也乱乳乾亀了予争事二云互五井些亡亢交享京亭人仁仇今介仏仔仕他付代令以仮仰仲件任企伊伍伏伐休会伝伯伴伸伺似伽佇位低住佐体何余作佳併使例侍供依価侮侵侶便係促俊俗保信俤修俯俺倅倉個倍倒倖候借倣値倦偉偏停健側偵偶偽傀傍傘備催傲傷傾僅働像僕僚僥僧僭僻儀億儘儚償儡優儲元兄充兆先光克免兎児党入全八公六共兵其具典兼内円冊再冒冗写冠冥冬冴冷凄凋凍凝几凡処凶凸凹出函刀刃分切刈刊刑列初判別利到刳制刷券刹刺刻則削剌前剖剛剣剤剥副剰割創劇力功加劣助努劫励労効劾勇勉動勘務勝募勢勤勧勾勿匂包化北匹区医匿十千升午半卑卒卓協南単博占印危即却卵卸厄厚原厠厨厭厳去参叉及友双反収叔取受叙叟叢口古句叩只叫召可台叱史右叶号司各合吉吊同名吐向君吝吟吠否含吸吹吼吾呂呆呈告呑呟周呪味呵呷呻呼命咄和咎咤咬咲咳咽哀品哄哉員哭哲唆唇唐唖唯唱唸唾商問啓啖啜啣啼喀善喉喋喘喚喜喝喧喩喪喫喰営嗄嗅嗇嗚嗜嗟嗤嘆嘔嘖嘘嘩嘲嘴噂噌噛噤器噪噴嚇嚥囁囃囓囚四回因団囮困囲図固国圃圏園土圧在地坂均坊坐坦垂型垢垣埃埋城埒埜域執培基堀堂堅堆堕堙堡堪堰報場堵堺塀塊塔塗塚塞塩填塵塹塾境墓増墜墟墨壁壇壊壌壕士壮声売壺変夏夕外多夜夢大天太夫央失奇奈奉奏契奔套奢奥奪奮女奴好如妄妙妥妨妬妹妻姉始姑姓委姪姻姿威娘娯娶婆婚婦婿媒媚嫁嫉嫌嬉嬢子孔孕字存孝季孤学孫宅守安完宗官宙定宛宜宝実客宣室宥宮害宴宵家容宿寂寄密富寒寛寝寞察寡寧審寮寵寸寺対寿封専射将尊尋導小少尖就尻尼尽尾尿局屁居屈届屋屍屑屓展属屠層履山屹岐岡岩岳岸峙峠島峻崇崎崖崩嵌嵩嶋巌川州巡巣工左巨巫差己巳巻巾市布希帝師席帯帰帳帷常帽幄幅幕幡幢幣干平年幸幹幻幼幽幾広庇床序底店庚度座庫庭庶康庸廂廃廊延建廻弁弄弊式弓弔引弘弛弟弥弧弱張強弾当形彩彫影彷役彼往径待徊律後徐徒従得徘御徨復循微徳徴徹心必忌忍忖志忘忙応忠快念忽忿怒怖思怠急性怨怪怯怺恃恋恐恒恣恥恨恩恫恭息恰恵悄悍悔悖悟悠患悦悩悪悲悴悶悸悼情惑惚惜惧惨惰想惹愁愉愍意愕愚愛感愧愴慄慇慈態慌慎慕慚慟慢慣慨慮慰慴憂憊憎憐憑憔憚憤憧憩憫憮憶憺懃懇懊懐懣懦懲懸成我戒戚戦截戮戯戸戻房所扁扇扈扉手

zthdsb avatar Jul 30 '19 08:07 zthdsb

train_listfile=text.training_file.txt

This needs to be a text file with a list of lstmf files which are used for training.

It seems to me that you have put a list of all characters in it. That text is not suitable even as a training text to be used for generating the lstmf files.

Please see the page regarding training in the wiki.

Shreeshrii avatar Jul 30 '19 08:07 Shreeshrii

Could you give me a example of testing file?Thanks very much.

zthdsb avatar Jul 30 '19 08:07 zthdsb

See https://github.com/tesseract-ocr/langdata/blob/master/jpn/jpn.training_text for Training text sample.

See https://github.com/Shreeshrii/tess4tutorial/blob/master/trainlayer/eng.training_files.txt for sample of file to be given for --train_listfile

On Tue, Jul 30, 2019 at 2:06 PM zthdsb [email protected] wrote:

Could you give me a example of testing file?Thanks very much.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2586?email_source=notifications&email_token=ABG37I6B7KZQVUWD4YH4TZLQB74QVA5CNFSM4IGMDJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3DHMWY#issuecomment-516322907, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I6A3B25FYO6QLSHQQ3QB74QVANCNFSM4IGMDJHQ .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii avatar Jul 30 '19 08:07 Shreeshrii

How to create the .lstmf file?

zthdsb avatar Jul 30 '19 08:07 zthdsb

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#using-tesstrainsh

OMP_THREAD_LIMIT=1 tesseract $my_file ${my_file%.*} -l jpn --psm 6 lstm.train

Shreeshrii avatar Jul 30 '19 09:07 Shreeshrii

e3 ffffff83 ffffff96 ffffffe3 ffffff83 ffffff97 ffffffe3 ffffff83 ffffff98 ffffffe3 ffffff83 ffffff99 ffffffe3 ffffff83 ffffff9a ffffffe3 ffffff83 ffffff9b ffffffe3 ffffff83 ffffff9c ffffffe3 ffffff83 ffffff9d ffffffe3 ffffff83 ffffff9e ffffffe3 ffffff83 ffffff9f ffffffe3 ffffff83 ffffffa0 ffffffe3 ffffff83 ffffffa1 ffffffe3 ffffff83 ffffffa2 ffffffe3 ffffff83 ffffffa3 ffffffe3 ffffff83 ffffffa4 ffffffe3 ffffff83 ffffffa5 ffffffe3 ffffff83 ffffffa6 ffffffe3 ffffff83 ffffffa7 ffffffe3 ffffff83 ffffffa8 ffffffe3 ffffff83 ffffffa9 ffffffe3 ffffff83 ffffffaa ffffffe3 ffffff83 ffffffab ffffffe3 ffffff83 ffffffac ffffffe3 ffffff83 ffffffad ffffffe3 ffffff83 ffffffaf ffffffe3 ffffff83 ffffffb3 ffffffe3 ffffff83 ffffffb4 ffffffe3 ffffff83 ffffffb6 ffffffe3 ffffff83 ffffffbc ffffffe4 ffffffb8 ffffff80 ffffffe4 ffffffb8 ffffff81 ffffffe4 ffffffb8 ffffff83 ffffffe4 ffffffb8 ffffff87 ffffffe4 ffffffb8 ffffff88 ffffffe4 ffffffb8 ffffff89 ffffffe4 ffffffb8 ffffff8a ffffffe4 ffffffb8 ffffff8b ffffffe4 ffffffb8 ffffff8d ffffffe4 ffffffb8 ffffff8e ffffffe4 ffffffb8 ffffff96 ffffffe4 ffffffb8 ffffff98 ffffffe4 ffffffb8 ffffffa1 ffffffe4 ffffffb8 ffffffa6 ffffffe4 ffffffb8 ffffffad ffffffe4 ffffffb8 ffffffb2 ffffffe4 ffffffb8 ffffffb8 ffffffe4 ffffffb8 ffffffb9 ffffffe4 ffffffb8 ffffffbb ffffffe4 ffffffb9 ffffff85 ffffffe4 ffffffb9 ffffff8b ffffffe4 ffffffb9 ffffff8f ffffffe4 ffffffb9 ffffff96 ffffffe4 ffffffb9 ffffff97 ffffffe4 ffffffb9 ffffff99 ffffffe4 ffffffb9 ffffff9d ffffffe4 ffffffb9 ffffff9e ffffffe4 ffffffb9 ffffff9f ffffffe4 ffffffb9 ffffffb1 ffffffe4 ffffffb9 ffffffb3 ffffffe4 ffffffb9 ffffffbe ffffffe4 ffffffba ffffff80 ffffffe4 ffffffba ffffff86 ffffffe4 ffffffba ffffff88 ffffffe4 ffffffba ffffff89 ffffffe4 ffffffba ffffff8b ffffffe4 ffffffba ffffff8c ffffffe4 ffffffba ffffff91 ffffffe4 ffffffba ffffff92 ffffffe4 ffffffba ffffff94 ffffffe4 ffffffba ffffff95 ffffffe4 ffffffba ffffff9b ffffffe4 ffffffba ffffffa1 ffffffe4 ffffffba ffffffa2 ffffffe4 ffffffba ffffffa4 ffffffe4 ffffffba ffffffab ffffffe4 ffffffba ffffffac ffffffe4 ffffffba ffffffad ffffffe4 ffffffba ffffffba ffffffe4 ffffffbb ffffff81 ffffffe4 ffffffbb ffffff87 ffffffe4 ffffffbb ffffff8a ffffffe4 ffffffbb ffffff8b ffffffe4 ffffffbb ffffff8f ffffffe4 ffffffbb ffffff94 ffffffe4 ffffffbb ffffff95 ffffffe4 ffffffbb Excuse me, The screen show that.Is it training?

zthdsb avatar Jul 30 '19 09:07 zthdsb

Excause me : the file of "text.font.exp0.lstmf" is generated by the command:tesseract text.font.exp0.tif text.font.exp0 -l jpn lstm.train The content of text.training_file.txt is "../new_6/text.font.exp0.lstmf" Then the command is : lstmtraining --model_output="D:\new_6\output" --continue_from jpn.lstm --train_listfile=text.training_file.txt --traineddata="D:\new_6\jpn.traineddata" -max_iterations 5

Is it right? what the coded systen of the text.training_file.txt?

zthdsb avatar Jul 30 '19 09:07 zthdsb

ls -1 ../new_6/text.font.exp0.lstmf > text.training_file.txt

should create the file in correct format. I think it is a utf-8 file with Unix EOL and a blank line at end.

Shreeshrii avatar Jul 30 '19 09:07 Shreeshrii

lstmtraining --model_output="D:\new_6\output" --continue_from jpn.lstm --train_listfile=text.training_file.txt --traineddata="D:\new_6\jpn.traineddata" -max_iterations 5 Loaded file jpn.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from jpn.lstm Deserialize header failed: ../new_6/text.font.exp0.lstmf Load of page 0 failed! Load of images failed!!

zthdsb avatar Jul 30 '19 09:07 zthdsb

Please check the location of ../new_6/text.font.exp0.lstmf since you have given a relative path.

Try with complete absolute path of your lstmf file.

Shreeshrii avatar Jul 30 '19 10:07 Shreeshrii

I‘m readed with complete absolute path of your lstmf file.but also show: Continuing from jpn.lstm Deserialize header failed: D:/new_6/text.font.exp0.lstmf Load of page 0 failed! Load of images failed!!

zthdsb avatar Jul 31 '19 01:07 zthdsb

These tests fail currently on a big endian machine:

  • layout_test (FAIL)
  • lstmtrainer_test (crash)
  • lstm_recode_test (crash)

One of the reasons is the non portable (de)serialisation of class TBOX.

stweil avatar Mar 01 '21 16:03 stweil

Pull request #3315 fixes some parts of the problem. lstm_recode_test now works.

stweil avatar Mar 01 '21 17:03 stweil

I do not have access to a big endian machine so I can't test this.

Shreeshrii avatar Mar 25 '21 04:03 Shreeshrii

@stweil,

Maybe we should label this issue as wontfix and close it.

amitdo avatar Sep 19 '22 14:09 amitdo