tesseract
tesseract copied to clipboard
Space corrupts the trained model
There is a weird problem that I have noticed while training:
If my training text includes a space
at the beginning or at the end of a line, this cause:
- Longer training-time.
- lower recognition-rate.
- hinder the convergence.
The more instances of having lines that include the space
at the beginning or at the end the worse the symptoms become, and would even make the model hallucinate and see spaces everywhere, thus corrupting the model.
Solution:
- Make Tesseract automatically removes spaces from the beginning and the end of the lines before generating the images.
Tab is used as end of line for box files used for LSTM training.
On Wed, Jul 11, 2018 at 6:06 PM christophered [email protected] wrote:
There is a weird problem that I have noticed while training: If my training text includes a space at the beginning or at the end of a line, this cause:
- Longer training-time.
- lower recognition-rate.
- hinder the convergence.
The more instances of having lines that include the space at the beginning or at the end the worse the symptoms become, and would even make the model hallucinate and see spaces everywhere, thus corrupting the model. My guess is that tab also have the same effect and even worst, since it might create confusion between it and space. Solution:
- Make Tesseract automatically removes spaces from the beginning and the end of the lines before generating the images.
- Add Tab to the forbidden characters, see wiki for code points. https://en.wikipedia.org/wiki/Tab_key#Unicode
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1774, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0qsz8__fIkz01vCl5Y6H33bjPWmks5uFfEwgaJpZM4VLDfI .
--
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
@Shreeshrii then just space
@stweil,
Should leading and trailing spaces be removed from the GT in tesseract training tool or by https://github.com/tesseract-ocr/tesstrain ?
https://github.com/tesseract-ocr/tesstrain/search?q=strip