tesseract
tesseract copied to clipboard
Deserialize header failed: lstmf files do not work across machines with different endianess
The unittests need some lstmf files which are there in the repo https://github.com/tesseract-ocr/test/tree/master/testdata, which is used as a submodule in tesseract.
The unittests run ok on ppc64le (lttle endian) but fail with following error on ppc64 (big endian).
[ RUN ] LSTMTrainerTest.EncodesEng
Config file is optional, continuing...
Failed to read data from: /home/shreeshrii/langdata_lstm//eng/eng.config
Warning: given outputs 1 not equal to unicharset of 112.
Num outputs,weights in Series:
1,1,0,32:32, 0
Lbx100:200, 106400
Fc112:112, 22512
Total weights = 128912
Built network:[1,1,0,32Lbx100Fc112] from request [1,1,0,32 Lbx100 O1c1]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Deserialize header failed: /home/shreeshrii/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file ../../../src/ccstruct/imagedata.cpp, line 651
FAIL lstmtrainer_test (exit status: 133)
Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from LSTMTrainerTest
[ RUN ] LSTMTrainerTest.RecodeTestKorBase
Config file is optional, continuing...
Warning: given outputs 1 not equal to unicharset of 836.
Num outputs,weights in Series:
1,1,0,32:32, 0
Lbx96:192, 99072
Fc836:836, 161348
Total weights = 260420
Built network:[1,1,0,32Lbx96Fc836] from request [1,1,0,32 Lbx96 O1c1]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Deserialize header failed: /home/shreeshrii/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf
First document cannot be empty!!
num_pages_per_doc_ > 0:Error:Assert failed:in file ../../../src/ccstruct/imagedata.cpp, line 651
FAIL lstm_recode_test (exit status: 133)
uname -a
Linux rh-power-vm61.fit.vutbr.cz 4.16.3-301.fc28.ppc64 #1 SMP Mon Apr 23 21:44:46 UTC 2018 ppc64 ppc64 ppc64 GNU/Linux
tesseract -v
tesseract 5.0.0-alpha-319-g8dc3
leptonica-1.78.0
libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.2
They run ok on a different machine.
uname -a
Linux tesseract-ocr 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:54:50 UTC 2019 ppc64le ppc64le ppc64le GNU/Linux
tesseract -v
tesseract 5.0.0-alpha-322-g74ac
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.2.54 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.4.4 : libopenjp2 2.3.0
Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from LSTMTrainerTest
[ RUN ] LSTMTrainerTest.RecodeTestKorBase
Config file is optional, continuing...
Warning: given outputs 1 not equal to unicharset of 836.
Num outputs,weights in Series:
1,1,0,32:32, 0
Lbx96:192, 99072
Fc836:836, 161348
Total weights = 260420
Built network:[1,1,0,32Lbx96Fc836] from request [1,1,0,32 Lbx96 O1c1]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Loaded 464/464 lines (1-464) of document /home/ubuntu/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf
Running main() from ../../unittest/../googletest/googletest/src/gtest_main.cc
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from LSTMTrainerTest
[ RUN ] LSTMTrainerTest.EncodesEng
Config file is optional, continuing...
Failed to read data from: /home/ubuntu/langdata_lstm//eng/eng.config
Warning: given outputs 1 not equal to unicharset of 112.
Num outputs,weights in Series:
1,1,0,32:32, 0
Lbx100:200, 106400
Fc112:112, 22512
Total weights = 128912
Built network:[1,1,0,32Lbx100Fc112] from request [1,1,0,32 Lbx100 O1c1]
Training parameters:
Debug interval = 0, weights = 0.1, learning rate = 0.01, momentum=0.9
null char=2
Loaded 929/929 lines (1-929) of document /home/ubuntu/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
Config file is optional, continuing...
Failed to read data from: /home/ubuntu/langdata_lstm//eng/eng.config
Null char=2
Warning: given outputs 1 not equal to unicharset of 111.
Num outputs,weights in Series:
1,1,0,32:32, 0
Lbx100:200, 106400
Fc111:111, 22311
...
The file sizes on both machines are same.
ls -l /home/shreeshrii/tesseract/test/testdata/*.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 737425 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/deu.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 1680435 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/eng.Arial.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 1511827 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 756028 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/fra.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 866492 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/kan.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r--. 1 shreeshrii shreeshrii 979338 Jul 14 05:11 /home/shreeshrii/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf
ls -l /home/ubuntu/tesseract/test/testdata/*.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 737425 Jul 9 09:57 /home/ubuntu/tesseract/test/testdata/deu.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 1680435 Jul 9 09:57 /home/ubuntu/tesseract/test/testdata/eng.Arial.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 1511827 Jul 9 09:57 /home/ubuntu/tesseract/test/testdata/eng.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 756028 Jul 9 09:57 /home/ubuntu/tesseract/test/testdata/fra.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 866492 Jul 9 09:57 /home/ubuntu/tesseract/test/testdata/kan.Arial_Unicode_MS.exp0.lstmf
-rw-rw-r-- 1 ubuntu ubuntu 979338 Jul 9 09:57 /home/ubuntu/tesseract/test/testdata/kor.Arial_Unicode_MS.exp0.lstmf
lstmtraining --model_output="C:\Users\zhangtiehai\Desktop\new_6\output" --continue_from jpn.lstm --train_listfile=text.training_file.txt --traineddata="C:\Users\zhangtiehai\Desktop\new_6\jpn.traineddata" -max_iterations 800 -U text.unicharset Loaded file jpn.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from jpn.lstm Deserialize header failed: !'+,-01456789:?BCEHLSTUacdefilmnoprstuy?€??€?????€??€??€??€??€??€??€??€??€??€??€??€?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€????????????????????????????????????????????????????????????????????????????????????????????€??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€??€??€??€??€??€??€??€??€??€??€???????????????????????????€??????????????????????????????????????????????????€?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€?????????????????????????????????????????????????????????????????€???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????€??????????????????????????????????????€??????????????????????????????????????????eyLoad of page 0 failed! Load of images failed!! Excuse me,Why and How to slove the problem?
Please share the files you used for testing. Thanks!
!'+,-01456789:?BCEHLSTUacdefilmnoprstuy―…→、。々〇《》「」『』【】ぁあぃいうぇえぉおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろわをん゛ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチッツテデトドナニネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロワンヴヶー一丁七万丈三上下不与世丘両並中串丸丹主久之乏乖乗乙九乞也乱乳乾亀了予争事二云互五井些亡亢交享京亭人仁仇今介仏仔仕他付代令以仮仰仲件任企伊伍伏伐休会伝伯伴伸伺似伽佇位低住佐体何余作佳併使例侍供依価侮侵侶便係促俊俗保信俤修俯俺倅倉個倍倒倖候借倣値倦偉偏停健側偵偶偽傀傍傘備催傲傷傾僅働像僕僚僥僧僭僻儀億儘儚償儡優儲元兄充兆先光克免兎児党入全八公六共兵其具典兼内円冊再冒冗写冠冥冬冴冷凄凋凍凝几凡処凶凸凹出函刀刃分切刈刊刑列初判別利到刳制刷券刹刺刻則削剌前剖剛剣剤剥副剰割創劇力功加劣助努劫励労効劾勇勉動勘務勝募勢勤勧勾勿匂包化北匹区医匿十千升午半卑卒卓協南単博占印危即却卵卸厄厚原厠厨厭厳去参叉及友双反収叔取受叙叟叢口古句叩只叫召可台叱史右叶号司各合吉吊同名吐向君吝吟吠否含吸吹吼吾呂呆呈告呑呟周呪味呵呷呻呼命咄和咎咤咬咲咳咽哀品哄哉員哭哲唆唇唐唖唯唱唸唾商問啓啖啜啣啼喀善喉喋喘喚喜喝喧喩喪喫喰営嗄嗅嗇嗚嗜嗟嗤嘆嘔嘖嘘嘩嘲嘴噂噌噛噤器噪噴嚇嚥囁囃囓囚四回因団囮困囲図固国圃圏園土圧在地坂均坊坐坦垂型垢垣埃埋城埒埜域執培基堀堂堅堆堕堙堡堪堰報場堵堺塀塊塔塗塚塞塩填塵塹塾境墓増墜墟墨壁壇壊壌壕士壮声売壺変夏夕外多夜夢大天太夫央失奇奈奉奏契奔套奢奥奪奮女奴好如妄妙妥妨妬妹妻姉始姑姓委姪姻姿威娘娯娶婆婚婦婿媒媚嫁嫉嫌嬉嬢子孔孕字存孝季孤学孫宅守安完宗官宙定宛宜宝実客宣室宥宮害宴宵家容宿寂寄密富寒寛寝寞察寡寧審寮寵寸寺対寿封専射将尊尋導小少尖就尻尼尽尾尿局屁居屈届屋屍屑屓展属屠層履山屹岐岡岩岳岸峙峠島峻崇崎崖崩嵌嵩嶋巌川州巡巣工左巨巫差己巳巻巾市布希帝師席帯帰帳帷常帽幄幅幕幡幢幣干平年幸幹幻幼幽幾広庇床序底店庚度座庫庭庶康庸廂廃廊延建廻弁弄弊式弓弔引弘弛弟弥弧弱張強弾当形彩彫影彷役彼往径待徊律後徐徒従得徘御徨復循微徳徴徹心必忌忍忖志忘忙応忠快念忽忿怒怖思怠急性怨怪怯怺恃恋恐恒恣恥恨恩恫恭息恰恵悄悍悔悖悟悠患悦悩悪悲悴悶悸悼情惑惚惜惧惨惰想惹愁愉愍意愕愚愛感愧愴慄慇慈態慌慎慕慚慟慢慣慨慮慰慴憂憊憎憐憑憔憚憤憧憩憫憮憶憺懃懇懊懐懣懦懲懸成我戒戚戦截戮戯戸戻房所扁扇扈扉手
train_listfile=text.training_file.txt
This needs to be a text file with a list of lstmf files which are used for training.
It seems to me that you have put a list of all characters in it. That text is not suitable even as a training text to be used for generating the lstmf files.
Please see the page regarding training in the wiki.
Could you give me a example of testing file?Thanks very much.
See https://github.com/tesseract-ocr/langdata/blob/master/jpn/jpn.training_text for Training text sample.
See https://github.com/Shreeshrii/tess4tutorial/blob/master/trainlayer/eng.training_files.txt for sample of file to be given for --train_listfile
On Tue, Jul 30, 2019 at 2:06 PM zthdsb [email protected] wrote:
Could you give me a example of testing file?Thanks very much.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/2586?email_source=notifications&email_token=ABG37I6B7KZQVUWD4YH4TZLQB74QVA5CNFSM4IGMDJH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3DHMWY#issuecomment-516322907, or mute the thread https://github.com/notifications/unsubscribe-auth/ABG37I6A3B25FYO6QLSHQQ3QB74QVANCNFSM4IGMDJHQ .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
How to create the .lstmf file?
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#using-tesstrainsh
OMP_THREAD_LIMIT=1 tesseract $my_file ${my_file%.*} -l jpn --psm 6 lstm.train
e3 ffffff83 ffffff96 ffffffe3 ffffff83 ffffff97 ffffffe3 ffffff83 ffffff98 ffffffe3 ffffff83 ffffff99 ffffffe3 ffffff83 ffffff9a ffffffe3 ffffff83 ffffff9b ffffffe3 ffffff83 ffffff9c ffffffe3 ffffff83 ffffff9d ffffffe3 ffffff83 ffffff9e ffffffe3 ffffff83 ffffff9f ffffffe3 ffffff83 ffffffa0 ffffffe3 ffffff83 ffffffa1 ffffffe3 ffffff83 ffffffa2 ffffffe3 ffffff83 ffffffa3 ffffffe3 ffffff83 ffffffa4 ffffffe3 ffffff83 ffffffa5 ffffffe3 ffffff83 ffffffa6 ffffffe3 ffffff83 ffffffa7 ffffffe3 ffffff83 ffffffa8 ffffffe3 ffffff83 ffffffa9 ffffffe3 ffffff83 ffffffaa ffffffe3 ffffff83 ffffffab ffffffe3 ffffff83 ffffffac ffffffe3 ffffff83 ffffffad ffffffe3 ffffff83 ffffffaf ffffffe3 ffffff83 ffffffb3 ffffffe3 ffffff83 ffffffb4 ffffffe3 ffffff83 ffffffb6 ffffffe3 ffffff83 ffffffbc ffffffe4 ffffffb8 ffffff80 ffffffe4 ffffffb8 ffffff81 ffffffe4 ffffffb8 ffffff83 ffffffe4 ffffffb8 ffffff87 ffffffe4 ffffffb8 ffffff88 ffffffe4 ffffffb8 ffffff89 ffffffe4 ffffffb8 ffffff8a ffffffe4 ffffffb8 ffffff8b ffffffe4 ffffffb8 ffffff8d ffffffe4 ffffffb8 ffffff8e ffffffe4 ffffffb8 ffffff96 ffffffe4 ffffffb8 ffffff98 ffffffe4 ffffffb8 ffffffa1 ffffffe4 ffffffb8 ffffffa6 ffffffe4 ffffffb8 ffffffad ffffffe4 ffffffb8 ffffffb2 ffffffe4 ffffffb8 ffffffb8 ffffffe4 ffffffb8 ffffffb9 ffffffe4 ffffffb8 ffffffbb ffffffe4 ffffffb9 ffffff85 ffffffe4 ffffffb9 ffffff8b ffffffe4 ffffffb9 ffffff8f ffffffe4 ffffffb9 ffffff96 ffffffe4 ffffffb9 ffffff97 ffffffe4 ffffffb9 ffffff99 ffffffe4 ffffffb9 ffffff9d ffffffe4 ffffffb9 ffffff9e ffffffe4 ffffffb9 ffffff9f ffffffe4 ffffffb9 ffffffb1 ffffffe4 ffffffb9 ffffffb3 ffffffe4 ffffffb9 ffffffbe ffffffe4 ffffffba ffffff80 ffffffe4 ffffffba ffffff86 ffffffe4 ffffffba ffffff88 ffffffe4 ffffffba ffffff89 ffffffe4 ffffffba ffffff8b ffffffe4 ffffffba ffffff8c ffffffe4 ffffffba ffffff91 ffffffe4 ffffffba ffffff92 ffffffe4 ffffffba ffffff94 ffffffe4 ffffffba ffffff95 ffffffe4 ffffffba ffffff9b ffffffe4 ffffffba ffffffa1 ffffffe4 ffffffba ffffffa2 ffffffe4 ffffffba ffffffa4 ffffffe4 ffffffba ffffffab ffffffe4 ffffffba ffffffac ffffffe4 ffffffba ffffffad ffffffe4 ffffffba ffffffba ffffffe4 ffffffbb ffffff81 ffffffe4 ffffffbb ffffff87 ffffffe4 ffffffbb ffffff8a ffffffe4 ffffffbb ffffff8b ffffffe4 ffffffbb ffffff8f ffffffe4 ffffffbb ffffff94 ffffffe4 ffffffbb ffffff95 ffffffe4 ffffffbb Excuse me, The screen show that.Is it training?
Excause me : the file of "text.font.exp0.lstmf" is generated by the command:tesseract text.font.exp0.tif text.font.exp0 -l jpn lstm.train The content of text.training_file.txt is "../new_6/text.font.exp0.lstmf" Then the command is : lstmtraining --model_output="D:\new_6\output" --continue_from jpn.lstm --train_listfile=text.training_file.txt --traineddata="D:\new_6\jpn.traineddata" -max_iterations 5
Is it right? what the coded systen of the text.training_file.txt?
ls -1 ../new_6/text.font.exp0.lstmf > text.training_file.txt
should create the file in correct format. I think it is a utf-8 file with Unix EOL and a blank line at end.
lstmtraining --model_output="D:\new_6\output" --continue_from jpn.lstm --train_listfile=text.training_file.txt --traineddata="D:\new_6\jpn.traineddata" -max_iterations 5 Loaded file jpn.lstm, unpacking... Warning: LSTMTrainer deserialized an LSTMRecognizer! Continuing from jpn.lstm Deserialize header failed: ../new_6/text.font.exp0.lstmf Load of page 0 failed! Load of images failed!!
Please check the location of ../new_6/text.font.exp0.lstmf since you have given a relative path.
Try with complete absolute path of your lstmf file.
I‘m readed with complete absolute path of your lstmf file.but also show: Continuing from jpn.lstm Deserialize header failed: D:/new_6/text.font.exp0.lstmf Load of page 0 failed! Load of images failed!!
These tests fail currently on a big endian machine:
- layout_test (FAIL)
- lstmtrainer_test (crash)
- lstm_recode_test (crash)
One of the reasons is the non portable (de)serialisation of class TBOX
.
Pull request #3315 fixes some parts of the problem. lstm_recode_test
now works.
I do not have access to a big endian machine so I can't test this.
@stweil,
Maybe we should label this issue as wontfix
and close it.