tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Normalization failed / Invalid start of grapheme sequence Error While training the tesseract model

Open Sanketnarkhede-10 opened this issue 1 year ago • 1 comments

Normalization failed for string 'ଜୀବନକୁ ନିବିଡ଼ ଭାବେ ଏକନ୍ୱିତ କରିଛନ୍ତି' Invalid start of grapheme sequence:D=0xb71 Normalization failed for string 'ପରମ୍ପରାକୁ ଅବଲମ୍ୱନ କରିଛନ୍ତି, ସେତିକି ମଧ୍ୟ' Invalid start of grapheme sequence:M=0xb48 Normalization failed for string 'ଦ୍ୱୈତ ରୂପରେ ଦେଖିଥିଲେ, ଏଠାରେ ପୁରୁଷ' Invalid start of grapheme sequence:M=0xb47 Normalization failed for string 'ତାଙ୍କ ହୃଦୟ ବିଭୋର ହୋଇଛି ସମ୍ୱେଦନଶୀଳତାରେ;' Invalid start of grapheme sequence:D=0xb71

I'm getting this error while training the tesseract ocr model for Oriya language , please help me to resolve this issue . I'm attaching the ground truth files .

Training on tesseract 4.1.1 : tesseract 4.1.1 leptonica-1.82.0

ocr_training.zip

Sanketnarkhede-10 avatar Jun 07 '23 03:06 Sanketnarkhede-10

Try to shorten those strings in your training data until the error messages disappear, then check what was wrong with them.

And please use the latest Tesseract version 5.3.1 instead of 4.1.1.

stweil avatar Jun 07 '23 04:06 stweil