tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

fine tuning arabic traineddata to solve extended words issue

Open sifdinNh opened this issue 1 year ago • 2 comments

so i want to finetune ara.traineddata in the traineddata_best repo to handle extended words like the this :

sample_9

to do that i made a list of lines with the same format like this :

.............
الســــــــيد العضـــــو د. عــــلي العتيبــــــي:
الســــــــيد العضـــــو جــــمال الحــــربي:
الســــــــيد العضـــــو د. خالــــد الفيصـــــل:
الســـــــــيد العضـــــو تركـــــي المطيــــري:
..............

i started by genereting ground truth files with .tif images and .box files

then started training with this:

make training MODEL_NAME=ara_new TESSDATA=../tesseract/tessdata START_MODEL=ara MAX_ITERATIONS=10000 LANG_TYPE=RTL

i started with 99%BCER and stoped when i had 24% BCER

when i came to test the traineddata file with evalute it with best traineddata ara.trainedata

i got a poor result

this is the result of best traineddata for arabic: sample_5 it's giving me almost 90% accuracy

but when i tested the new trained file this is the result : sample_5

it's like doesn't recognize anything and the main the reason i started this is to finetune it to better accuracy

sifdinNh avatar Nov 28 '23 19:11 sifdinNh

@zdenop

sifdinNh avatar Nov 29 '23 20:11 sifdinNh

uncertain if the issue arises because the model was trained on multiline in tiff, but have you attempted fine tuning with one line text in images? give it a try if not yet and share results with us

AhmadHakami avatar Jan 02 '24 20:01 AhmadHakami