tesstrain
tesstrain copied to clipboard
fine tuning arabic traineddata to solve extended words issue
so i want to finetune ara.traineddata in the traineddata_best repo to handle extended words like the this :
to do that i made a list of lines with the same format like this :
.............
الســــــــيد العضـــــو د. عــــلي العتيبــــــي:
الســــــــيد العضـــــو جــــمال الحــــربي:
الســــــــيد العضـــــو د. خالــــد الفيصـــــل:
الســـــــــيد العضـــــو تركـــــي المطيــــري:
..............
i started by genereting ground truth files with .tif
images and .box
files
then started training with this:
make training MODEL_NAME=ara_new TESSDATA=../tesseract/tessdata START_MODEL=ara MAX_ITERATIONS=10000 LANG_TYPE=RTL
i started with 99%BCER and stoped when i had 24% BCER
when i came to test the traineddata file with evalute it with best traineddata ara.trainedata
i got a poor result
this is the result of best traineddata for arabic:
it's giving me almost 90% accuracy
but when i tested the new trained file this is the result :
it's like doesn't recognize anything and the main the reason i started this is to finetune it to better accuracy
@zdenop
uncertain if the issue arises because the model was trained on multiline in tiff
, but have you attempted fine tuning with one line text in images? give it a try if not yet and share results with us