tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

LSTM: Training - Error msg - Encoding of string failed!

Open Shreeshrii opened this issue 9 years ago • 38 comments

$   training/lstmtraining --model_output ~/tesstutorial/sanskrit2003_from_full/sanskrit2003 \
>   --continue_from ~/tesstutorial/sanskrit2003_from_full/san.lstm \
>   --train_listfile ~/tesstutorial/santrain/san.training_files.txt \
>   --target_error_rate 0.01
Loaded file /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/sanskrit2003_from_full/sanskrit2003_checkpoint
Loaded 1746/1746 pages (0-1746) of document /home/shree/tesstutorial/santrain/san.Chandas.exp0.lstmf
Loaded 345/1760 pages (1415-1760) of document /home/shree/tesstutorial/santrain/san.Uttara.exp0.lstmf
Loaded 1814/1814 pages (0-1814) of document /home/shree/tesstutorial/santrain/san.Gargi.exp0.lstmf
Found AVX
Found SSE
At iteration 1808/17200/17229, Mean rms=0.336%, delta=0.129%, char train=0.41%, word train=1.751%, skip ratio=0.2%,  New worst char error = 0.41 wrote checkpoint.

Encoding of string failed! Failure bytes: ffffffc2 ffffffa3 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffffb5
Can't encode transcription: व्यतर्कि १४. भवति ३७॥ £ सर्व्व
At iteration 1818/17300/17330, Mean rms=0.334%, delta=0.13%, char train=0.404%, word train=1.632%, skip ratio=0.3%,  wrote checkpoint.


Shreeshrii avatar Dec 09 '16 04:12 Shreeshrii

Still getting the errors with the following version -


 tesseract -v
tesseract 4.00.00alpha-219-gc124f87
 leptonica-1.74
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8


Can't encode transcription: सगुनल उठैलका देउता नेउता लवरना लोहमान कुदार
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 fffff
fa4 ffffffb9 ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa
4 ffffffbe ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffff85 ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4
ffffffb8 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ff
ffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: बिसहरी सड़िया हड़िया लादना अधसेरी सुबुकना
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8 20 ffffffe0 fffff
fa4 ffffffac ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbf 20 ffffffe0 ffffffa
4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4
ffffffb6 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff87 20 ffffffe0 ffffffa4 ff
ffffb8 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffa7 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffff
ff9c ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa4 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb0 20 ffffffe0 ffffffa4 ffffff
a8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffff97 ffffffe0 ffffffa5 ffffff81 ffffffe0 ffffffa4 ffffffa8 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ff
ffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffff81
Can't encode transcription: चूड़ियन बुद्धि गुप्ता शासनमे सुद्धा जँतसार निगुनियाँ
Encoding of string failed! Failure bytes: ffffffe0 ffffffa5 ffffff9c ffffffe0 ffffffa4 ffffff87 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff82 ffffffe0 ffffffa4
 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffac ffffffe0 ffffffa
5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4 ffffffbe 20 ffffffe0 ffffffa4 ffffffae ffffffe0 ffffffa5 ffffff8b ffffffe0 ffffffa4 ffffffa5 ffffffe0 ffffffa4
ffffffbe 20 ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa5 ffffff87 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ff
ffff8d ffffffe0 ffffffa4 ffffff9b ffffffe0 ffffffa4 ffffffbe ffffffe0 ffffffa4 ffffffb8 ffffffe0 ffffffa4 ffffff81 20 ffffffe0 ffffffa4 ffffffaa ffffffe0 ffffffa4 ffff
ffbe ffffffe0 ffffffa4 ffffffb0 ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9f ffffffe0 ffffffa5 ffffff80 20 ffffffe0 ffffffa4 ffffffb2 ffffffe0 ffffffa5 ffffff
9c ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa4 ffffffbf ffffffe0 ffffffa4 ffffffaf ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: दौड़इलूँ पोथा बोथा मोथा स्वेच्छासँ पार्टी लड़कियन

Shreeshrii avatar Dec 28 '16 04:12 Shreeshrii

@Also seen in finetune of Arabic


lstmtraining --model_output ~/tesstutorial/aratuned_from_ara/aratuned   --continue_from ~/tesstutorial/aratuned_from_ara/ara.lstm   --train_listfile ~/tesstutorial/ara/ara.training_files.txt     --eval_listfile ~/tesstutorial/aratest/ara.training_files.txt   --target_error_
rate 0.0001
Loaded file /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint, unpacking...
Successfully restored trainer from /home/shree/tesstutorial/aratuned_from_ara/aratuned_checkpoint
Loaded 229/229 pages (1-229) of document /home/shree/tesstutorial/ara/ara.Amiri.exp0.lstmf
Loaded 232/232 pages (1-232) of document /home/shree/tesstutorial/ara/ara.Arial.exp0.lstmf
Loaded 4/4 pages (1-4) of document /home/shree/tesstutorial/aratest/ara.Times_New_Roman.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff8a ffffffd9 ffffff82 ffffffd9 ffffff90 ffffffd8 ffffffaf ffffffd9 ffffff90 ffffffd8 ffffffa7
 ffffffd8 ffffffb5 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd8 ffffffaa ffffffd9 ffffff8f ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd9 ffffff83 f
fffffd9 ffffff8f 20 ffffffd9 ffffff86 ffffffd9 ffffff92 ffffffd8 ffffffa5 ffffffd9 ffffff90 20 ffffffd8 ffffffa7 ffffffd9 ffffff84 ffffffd9 ffffff84 ffffffd9 ffffff91
ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff90 20 ffffffd9 ffffff86 ffffffd9 ffffff90 ffffffd9 ffffff88 ffffffd8 ffffffaf ffffffd9 ffffff8f 20 ffffffd9 ffffff86
 ffffffd9 ffffff92 ffffffd9 ffffff85 ffffffd9 ffffff90 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff83 ffffffd9 ffffff8f ffffffd8 ffffffa1 ffffffd9 ffffff8e f
fffffd8 ffffffa7 ffffffd8 ffffffaf ffffffd9 ffffff8e ffffffd9 ffffff87 ffffffd9 ffffff8e ffffffd8 ffffffb4 ffffffd9 ffffff8f
Can't encode transcription: نَيقِدِاصَ مْتُنْكُ نْإِ اللَّهِ نِودُ نْمِ مْكُءَادَهَشُ
Loaded 231/231 pages (1-231) of document /home/shree/tesstutorial/ara/ara.Arial_Unicode_MS.exp0.lstmf
Encoding of string failed! Failure bytes: ffffffd9 ffffff8e ffffffd9 ffffff88 ffffffd8 ffffffb1 ffffffd9 ffffff8f ffffffd8 ffffffb5 ffffffd9 ffffff90 ffffffd8 ffffffa8
 ffffffd9 ffffff92 ffffffd9 ffffff8a ffffffd9 ffffff8f 20 ffffffd9 ffffff84 ffffffd9 ffffff8e ffffffd8 ffffffa7 20 ffffffd8 ffffffaa ffffffd9 ffffff8d ffffffd8 ffffffa
7 ffffffd9 ffffff85 ffffffd9 ffffff8e ffffffd9 ffffff84 ffffffd9 ffffff8f ffffffd8 ffffffb8 ffffffd9 ffffff8f 20 ffffffd9 ffffff8a ffffffd9 ffffff81 ffffffd9 ffffff90
20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff8f ffffffd9 ffffff83 ffffffd9 ffffff8e ffffffd8 ffffffb1 ffffffd9 ffffff8e ffffffd8 ffffffaa ff
ffffd9 ffffff8e ffffffd9 ffffff88 ffffffd9 ffffff8e 20 ffffffd9 ffffff85 ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 ffffffd8 ffffffb1 ffffffd9 ffffff90 ffff
ffd9 ffffff88 ffffffd9 ffffff86 ffffffd9 ffffff8f ffffffd8 ffffffa8 ffffffd9 ffffff90
Can't encode transcription: نَورُصِبْيُ لَا تٍامَلُظُ يفِ مْهُكَرَتَوَ مْهِرِونُبِ
Encoding of string failed! Failure bytes: ffffffd9 ffffff92 ffffffd9 ffffff87 ffffffd9 ffffff90 

Shreeshrii avatar Dec 31 '16 06:12 Shreeshrii

See new section in trainingtesseract-4.00

theraysmith avatar Jan 11 '17 23:01 theraysmith

Wiki does not seem to have this section,

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

TrainingTesseract 4.00 Stefan Weil edited this page 28 days ago · 9 revisions

We have a github outage in India just now, not sure if this is related to that or wiki updation is still in todo.

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 12, 2017 at 5:04 AM, theraysmith [email protected] wrote:

See new section in trainingtesseract-4.00

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-272030162, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o2Kj43a8uaNjjhRJt8EBMPHq9-kgks5rRWcEgaJpZM4LIjyK .

Shreeshrii avatar Jan 12 '17 08:01 Shreeshrii

It is working correctly in Spain, Thank you all for the incredible amount of work that you have all done.

Brian51 avatar Jan 12 '17 09:01 Brian51

I don't see the changes either.

The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them yet.

amitdo avatar Jan 12 '17 10:01 amitdo

Changes are pushed now. I got called away yesterday before I was able to do it.

On Thu, Jan 12, 2017 at 2:36 AM, Amit D. [email protected] wrote:

I don't see the changes either.

The wiki can be cloned as a git repo. Ray probably did some edits locally, but didn't 'push' them.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-272130094, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056X0eolRJLjvYL3TR3hp1-wfTyoGKks5rRgJFgaJpZM4LIjyK .

-- Ray.

theraysmith avatar Jan 12 '17 17:01 theraysmith


Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e
Can't encode transcription: Møller.     1200Emilie,hansKone.

when trying to train frk

Shreeshrii avatar Jan 21 '17 14:01 Shreeshrii

The tab character (9) at the beginning of the list of failure bytes is a dead giveaway.

On Sat, Jan 21, 2017 at 6:15 AM, Shreeshrii [email protected] wrote:

Encoding of string failed! Failure bytes: 9 31 32 30 30 45 6d 69 6c 69 65 2c 68 61 6e 73 4b 6f 6e 65 2e Can't encode transcription: Møller. 1200Emilie,hansKone.

when trying to train frk

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549#issuecomment-274264239, or mute the thread https://github.com/notifications/unsubscribe-auth/AL056Z_ATRDHUb3698yrRFfl1XSJTJM3ks5rUhMAgaJpZM4LIjyK .

-- Ray.

theraysmith avatar Jan 23 '17 19:01 theraysmith

@Shreeshrii Is this issue resolved coz I'm getting the same when training with Telugu language..

harinath141 avatar Feb 02 '17 08:02 harinath141

Please see https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training

Encoding of string failed! results when the text string for a training image 
cannot be encoded using the given unicharset. 

Possible causes are:

- There  is an un-represented character in the text, say a British Pound sign that is not in your unicharset.

- A  stray unprintable character (like tab or a control character) in the text.

- There  is an un-represented Indic grapheme/aksara in the text.

In any case it will result in that training image being ignored by the trainer. 

If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

Shreeshrii avatar Feb 02 '17 09:02 Shreeshrii

@harinath141 If you are getting a lot of these errors during finetune, try replace top layer training. You can use the box/tiff pairs generated for finetune. Commands will be similar to the following:

mkdir -p ~/tesstutorial/tellayer_from_tel 

combine_tessdata -e ../tessdata/tel.traineddata \
  ~/tesstutorial/tellayer_from_tel/tel.lstm
  
lstmtraining -U ~/tesstutorial/tel/tel.unicharset \
  --script_dir ../langdata  --debug_interval 0 \
  --continue_from ~/tesstutorial/tellayer_from_tel/tel.lstm \
  --append_index 5 --net_spec '[Lfx256 O1c105]' \
  --model_output ~/tesstutorial/tellayer_from_tel/tellayer \
  --train_listfile ~/tesstutorial/tel/tel.training_files.txt \
  --target_error_rate 0.01

Shreeshrii avatar Feb 02 '17 13:02 Shreeshrii

~/tesstutorial/tel/ should have your .lstmf files.

Shreeshrii avatar Feb 02 '17 14:02 Shreeshrii

Thank you @Shreeshrii I'll try to replace top layer

harinath141 avatar Feb 02 '17 14:02 harinath141

@harinath141

When you use --debug_interval 0 you will see messages every 100 iterations like the following:

At iteration 45909/58500/58569, Mean rms=0.639%, delta=0.621%, char train=1.861%, word train=13.302%, skip ratio=0%,  wrote checkpoint.

At iteration 45960/58600/58669, Mean rms=0.64%, delta=0.616%, char train=1.844%, word train=12.933%, skip ratio=0%,  wrote checkpoint.

2 Percent improvement time=14052, best error was 3.697 @ 31958
At iteration 46010/58700/58769, Mean rms=0.634%, delta=0.561%, char train=1.686%, word train=12.343%, skip ratio=0%,  New best char error = 1.686 wrote best model:/hom
e/shree/tesstutorial/khmlayer1_from_khm/khm1.686_46010.lstm wrote checkpoint.

When you use --debug_interval -1 , messages such as the following will be shown for every iteration:


Iteration 59400: ALIGNED TRUTH : មានរូបឆ្មាំ អេស៊ីលីដា
Iteration 59400: BEST OCR TEXT : មានរូបឆ្មាំ អេស៊ីលីដា
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer_Bold.exp0.lstmf page 53 (Perfect):
Mean rms=0.646%, delta=0.553%, train=1.878%(13.168%), skip ratio=0.1%
Iteration 59401: ALIGNED TRUTH : ឆ្កៀលយកភ្នែក ជួនឆ្លងវគ្គ ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
Iteration 59401: BEST OCR TEXT : ឆ្លៀលយកភ្នែក ជួនឆ្លងវគត ចាប់ពីពេលនោះមក របស់គាត់ កុំធេ្វសគំនិត។ អូនហ្អើយ =
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Noto_Serif_Khmer.exp0.lstmf page 1 :
Mean rms=0.647%, delta=0.555%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59402: ALIGNED TRUTH : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
Iteration 59402: BEST OCR TEXT : សឹងមានះរឹងត្អឹងមហិមា គុណ នៅប៉ែកឦសាននៃភ្នំ ទុលល្យូ ខេត្តស្ទឺងត្រែង,
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI_Bold.exp0.lstmf page 56 :
Mean rms=0.647%, delta=0.556%, train=1.881%(13.157%), skip ratio=0.1%
Iteration 59403: ALIGNED TRUTH : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍់៖តំបន់ខ្លះ ផ្ទះសម្បែង
Iteration 59403: BEST OCR TEXT : រឺគៃបន្លំបាន។ (រឿងអាខ្វាក់អាខ្វិន) អន្នំលោកង្សិ = ឧទាហរណ៍៖តំបន់ខ្លះ ផ្ទះសម្បែង
File /tmp/tmp.BjsuuQ0dgJ/khm/khm.Leelawadee_UI.exp0.lstmf page 51 :

intermediate checkpoint and .lstm files will be written to the output directory eg. ~/tesstutorial/tellayer_from_tel You can also see visual debugging output with scrollview.

Shreeshrii avatar Feb 03 '17 05:02 Shreeshrii

@theraysmith

I am still getting this error, for a new replace top layer training for Devanagari script, where the eval_listfile is based on a different training text. eg.

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff8d ffffffe0 ffffffa4 ffffff9a ffffffe0 ffffffa5 ffffff88 ffffffe0 ffffffa4 ffffff95 ffffffe0 ffffffa5 ffffff8b 20 ffffffe0 ffffffa4 ffffff9c ffffffe0 ffffffa5 ffffff80 ffffffe0 ffffffa4 ffffffb5 ffffffe0 ffffffa4 ffffffa8
Can't encode transcription: वैशाख साल देखि साथै यो साँच्चैको जीवन

Encoding of string failed! Failure bytes: ffffffe0 ffffffa4 ffffff81 ffffffe0 ffffffa4 ffffffa6 ffffffe0 ffffffa4 ffffffbe
Can't encode transcription: रूपांतरित जैबुन्निसा केंद्रित छँदा

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there as part of training text/unicharset, but is there as part of eval text/unicharset.

The training unicharset is of the following format:

3784
NULL 0 NULL 0
Joined 7 0,69,188,255,486,1218,0,30,486,1188 Latin 1 0 1 Joined	# Joined [4a 6f 69 6e 65 64 ]a
|Broken|0|1 f 0,69,186,255,892,2138,0,80,892,2058 Common 3625 10 3625 |Broken|0|1	# Broken
र्ध्रु 1 0,64,61,197,280,356,0,0,280,356 Devanagari 18 0 18 र्ध्रु	# र्ध्रु [930 94d 927 94d 930 941 ]x
र्बृ 1 3,64,61,197,181,236,0,0,181,236 Devanagari 18 0 18 र्बृ	# र्बृ [930 94d 92c 943 ]x
श्चु 1 0,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्चु	# श्चु [936 94d 91a 941 ]x
श्चौ 1 3,65,61,255,294,367,0,12,294,355 Devanagari 240 0 240 श्चौ	# श्चौ [936 94d 91a 94c ]x
श्च् 1 3,64,61,197,251,303,0,12,251,291 Devanagari 240 0 240 श्च्	# श्च् [936 94d 91a 94d ]x
य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 8 0 8 य	# य [92f ]x
श्रीः 1 3,74,61,253,295,412,0,12,295,400 Devanagari 240 0 240 श्रीः	# श्रीः [936 94d 930 940 903 ]x
ष्ठु 1 0,75,61,197,204,243,0,0,204,243 Devanagari 241 0 241 ष्ठु	# ष्ठु [937 94d 920 941 ]x
ष्ठौ 1 3,75,61,255,247,307,0,0,247,307 Devanagari 241 0 241 ष्ठौ	# ष्ठौ [937 94d 920 94c ]x
स्रैः 1 3,76,61,255,243,449,0,0,243,449 Devanagari 280 0 280 स्रैः	# स्रैः [938 94d 930 948 903 ]x
...

Does this mean that the training text needs to be expanded to include all possible akshara combinations?

Shreeshrii avatar Jun 14 '17 11:06 Shreeshrii

@Shreeshrii Thanks for your help yesterday. I encountered the same error (Encoding of string failed! Failure bytes: ffffffe0...) when training langdata/bod(Tibetan). It seemed most of the unicode characters are mis-decoded. I tried replacing top layers but still encountered the same error. Since I'm already using the latest langdata, is there anything I can do to correct the encoding? Could you help me? Thanks very much!

zc813 avatar Feb 02 '18 04:02 zc813

As per @theraysmith

  • There is an un-represented Indic grapheme/aksara in the text. In any case it will result in that training image being ignored by the trainer. If the error is infrequent, it is harmless, but it may indicate that your unicharset is inadequate for representing the language that you are training.

@zc813

tesstrain.sh has a limit of max_pages 3, you should change that so that complete training_text is used.

You can review the training_text to see that it is correct representation of bod(Tibetan).

Also test with 'Tibetan' script traineddata from both 'tessdata_best' and 'tessdata_fast' repo for OCR.

Authoritative answer can only be provided by @theraysmith.

Shreeshrii avatar Feb 02 '18 05:02 Shreeshrii

@Shreeshrii Thanks a lot for the reply! I'll try the solution.

btw I tried to decode the error message and found most of them started with

ffffffe0 ffffffbc ffffff8c ffffffe0 ffffffbc ffffff8d

i.e. ༌། (0xf0c 0xf0d) The (0xf0c) and (0xf0d) are already stored separately in my Tibetan.unicharset, I am kind of confused why they cannot be encoded when presented together.

zc813 avatar Feb 02 '18 05:02 zc813

Same problem as I had mentioned in one of my earlier comments -

While each unicode character (स ा ँ ) is there in the Devanagari unicharset, the combined akshara (साँ, छँ) is not there.

No answer from @theraysmith yet.. He has also marked this as a closed issue.

Shreeshrii avatar Feb 02 '18 05:02 Shreeshrii

@zdenop Ray had closed this so I can not reopen.

Please reopen this issue, because the problem is still there. It is related to utf-8/utf-16/utf-32 conversion.

Example:

Encoding of string failed! Failure bytes: cc 84 67 6e 65 Can't encode transcription: 'mamāgne' in language '' utf8 6D 61 6D 61 CC 84 67 6E 65 utf16 006D 0061 006D 0061 0304 0067 006E 0065 hex 006D 0061 006D 0061 0304 0067 006E 0065

Error is related to 'CC 84' in utf-8 which is '0304' in utf16 or hex.

string converted using the converter at https://r12a.github.io/app-conversion/

Shreeshrii avatar Jul 02 '18 20:07 Shreeshrii

https://stackoverflow.com/questions/42012563/convert-unicode-code-points-to-utf-8-and-utf-32

Shreeshrii avatar Jul 03 '18 04:07 Shreeshrii

https://github.com/tesseract-ocr/tesseract/blob/a80a8f17bb32be8bdd5124057219620b711491a7/src/lstm/lstmtrainer.cpp#L785

Shreeshrii avatar Jul 03 '18 04:07 Shreeshrii

@ivanzz1001Any ideas.

Shreeshrii avatar Jul 03 '18 04:07 Shreeshrii

Can't encode transcription: 'ঢাকা মেটো-গ' in language '' Encoding of string failed! Failure bytes: ffffffe0 ffffffa6 ffffffbe ffffffe0 ffffffa6 ffffff95 ffffffe0 ffffffa6 ffffffbe 20 ffffffe0 ffffffa6 ffffffae ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff9f ffffffe0 ffffffa7 ffffff87 ffffffe0 ffffffa6 ffffff97 Can't encode transcription: '|ঢাকা মেটেগ' in language '' ^Cmake: *** Deleting file 'data/checkpoints/banglaLPRNew_checkpoint' Makefile:129: recipe for target 'data/checkpoints/banglaLPRNew_checkpoint' failed

xhuvom avatar Oct 19 '18 13:10 xhuvom

It looks like this was the first report of the encoding problem, so I re-open it until it is (hopefully soon) solved.

stweil avatar Oct 09 '19 17:10 stweil

@stweil After this initial error report, Ray changed the LSTM training process so some of the comments will not be applicable with current code. Regardless, the issue is still there.

On Wed, Oct 9, 2019 at 11:29 PM Stefan Weil [email protected] wrote:

See also later errors with "Encoding of string failed" https://github.com/tesseract-ocr/tesseract/issues?utf8=%E2%9C%93&q=%22Encoding+of+string+failed%22 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/549?email_source=notifications&email_token=ABG37I2J4Q5AXOR6EOSOFITQNYLXRA5CNFSM4CZCHSFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYYODQ#issuecomment-540116750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABG37I3YLPMEKG5GIWNBHHTQNYLXRANCNFSM4CZCHSFA .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii avatar Oct 10 '19 11:10 Shreeshrii

I could fix the encoding errors for tesstrain by normalizing the ground truth texts, see https://github.com/tesseract-ocr/tesstrain/pull/111.

stweil avatar Oct 11 '19 16:10 stweil

@stweil If I understand the change correctly this normalizes the ground-truth text within the box file so errors will be avoided during LSTM training.

so any comparisons using the original ground truth files using diff, wdiff or or evaluation tools may still show errors for the normalized characters.

Also, this does not address the case when training is done using training_text and fonts.

I will suggest adding a new script normalize.py which can be used to normalize any training text before beginning training process and also adding normalization as part of creating the training text process in wiki.

Also, it maybe helpful to normalize all existing training_text files in langdata_lstm and langdata repos.

Shreeshrii avatar Oct 12 '19 05:10 Shreeshrii