tess5train-fonts
tess5train-fonts copied to clipboard
Performance Degradation After Finetuning
Environment
Tesseract Version: v4.0.0.20181030 Platform: Ubuntu16
Motivation Introduction
There are some articles, which contain both English alphabets and Greek alphabets, need to be OCRed. And I turned to Tesseract.
After installing Tesseract successfully, I opened a terminal and ran the command tesseract detector_sample_1.png result -l eng+grc, and get result.txt as a result.
The original image named "detector_sample_1.png" is shown as bellow.
And the result.txt is shown as bellow too.
I found that Tesseract works quite well, if disregarded the content in red block(s).
Actually, Greek alphabets do not appear too frequently in these articles. So I came up with the idea that I should retrain/finetune the existing eng.traineddata.
Therefore, I resorted to your code.
Description of My Experiment Process
After reading your README.md, I think I should firstly run 8-makedata_layernew.sh and 9-layernew.sh later. (Should do some modification certainly!)
In that I need to finetune the eng.traineddata with Greek alphabets, I prepared a training_text eng.anhao.training_text.txt. (I need to change the extension to .txt in that I can not upload the file with extension .training_text.) And I only cat ../langdata/eng/eng.training_text ../langdata/eng/eng.anhao.training_text >../langdata/eng/eng.layer.training_text (in 8-makedata_layernew.sh). What is more, I prepared a new test file eng.layertest.training_text.txt.
Then I ran ./8-makedata_layernew.sh and 9-layernew.sh. Afterwards, I get the eng_layer.traineddata.
Experiment Result
It is disappointed that the performance degraded, although the eng_layer.traineddata can recognize some Greek alphabets.
Conclusion
I tried to extend the existing model "eng.traineddata" with Greek alphabets, and I tried your code. But the result is disappointing. So I hope you could help me.
@Shreeshrii I also come across the similar issue, could you please help to address?