tesseract Can't encode transcription

Can't encode transcription

Open sameearif88 opened this issue 3 years ago • 1 comments

Hello, I am trying to train form scratch/fine tune tesseract for "Jameel Noori Nastaleeq" font for Urdu. The steps i did for training from scratch:

Create unicharset from all groundtruth files:

unicharset_extractor --output_unicharset file.unicharset --norm_mode 3 file

Create starter traineddata using above unicharset

combine_lang_model --input_unicharset file.unicharset --script_dir "langdata/" --output_dir "output/" --lang JNUrd

Create wordstrbox for each image

tesseract file1.png file1 --psm 6 wordstrbox

Manually correct wordstrbox files using the ground truth
Create lstmf file from each png and its corresponding box file

tesseract file.png file --psm 6 lstm.train

Create list of lstmf files to use for training

ls *.lstmf -1 > mylang.trainingfiles_text

I am getting this error on the training step:

Encoding of string failed! Failure bytes: ffffffd9 ffffff8a ffffffd9 ffffff94 ffffffdb ffffff92 20 ffffffd9 ffffff88 ffffffd8 ffffffb2 ffffffdb ffffff8c ffffffd8 ffffffb1 20 ffffffd8 ffffffae ffffffd8 ffffffa7 ffffffd8 ffffffb1 ffffffd8 ffffffac ffffffdb ffffff81 20 ffffffd8 ffffffb4 ffffffd8 ffffffa7 ffffffdb ffffff81 20 ffffffd9 ffffff85 ffffffd8 ffffffad ffffffd9 ffffff85 ffffffd9 ffffff88 ffffffd8 ffffffaf 20 ffffffd9 ffffff82 ffffffd8 ffffffb1 ffffffdb ffffff8c ffffffd8 ffffffb4 ffffffdb ffffff8c 20 ffffffd9 ffffff86 ffffffdb ffffff92 20 ffffffd8 ffffffa8 ffffffd8 ffffffaa ffffffd8 ffffffa7 ffffffdb ffffff8c ffffffd8 ffffffa7 20 ffffffda ffffffa9 ffffffdb ffffff81 20 ffffffd9 ffffff85 ffffffd9 ffffff84 ffffffd8 ffffffa7 ffffffd9 ffffff82 ffffffd8 ffffffa7 ffffffd8 ffffffaa

Can't encode transcription: 'بعد نجی ٹی وی سے گفتگو کرتے ہوئے وزیر خارجہ شاہ محمود قریشی نے بتایا کہ ملاقات' in language ''

I have tried normalizing the text using the normalize.py file. And I also tried fine-tuning for Urdu but both solutions don't work.

Aug 16 '21 17:08 sameearif88

This error was also reported in #1012.

Aug 18 '21 20:08 amitdo

tesseract tesseract copied to clipboard

Can't encode transcription

tesseract
tesseract copied to clipboard