tessdata icon indicating copy to clipboard operation
tessdata copied to clipboard

Best Traineddata Feedback - Gujarati - ન - ત Confusion

Open Shreeshrii opened this issue 7 years ago • 2 comments

When using tesseract 4.0 with --oem 1 (LSTM) with Gujarati traineddata, ન is being recognized as ત in the attached image.

Same image when recognized with --oem 0 is recognizing ન correctly, but has other accuracy problems.

So, it looks like that LSTM model for Gujarati has not been trained with this font.

Image and ground truth file are attached.

It would be helpful to have the ability to finetune using real images in addition to synthetic data.

guj.ag.exp0-GT.txt guj ag exp0

Shreeshrii avatar Jun 23 '17 14:06 Shreeshrii

Tested with both best/guj and best/Gujarati traineddata - psm 6 just now.

While the ન - ત Confusion is still there, Gujarati traineddata seems better than guj - it is dropping fewer words in OCR output.

Shreeshrii avatar Aug 04 '17 17:08 Shreeshrii

While the ન - ત Confusion is still there, Gujarati traineddata seems better than guj - it is dropping fewer words in OCR output.

https://github.com/tesseract-ocr/tesseract/pull/1264

amitdo avatar Oct 15 '18 10:10 amitdo