tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Space corrupts the trained model

Open ghost opened this issue 6 years ago • 4 comments

There is a weird problem that I have noticed while training: If my training text includes a space at the beginning or at the end of a line, this cause:

  • Longer training-time.
  • lower recognition-rate.
  • hinder the convergence.

The more instances of having lines that include the space at the beginning or at the end the worse the symptoms become, and would even make the model hallucinate and see spaces everywhere, thus corrupting the model.

Solution:

  • Make Tesseract automatically removes spaces from the beginning and the end of the lines before generating the images.

ghost avatar Jul 11 '18 12:07 ghost

Tab is used as end of line for box files used for LSTM training.

On Wed, Jul 11, 2018 at 6:06 PM christophered [email protected] wrote:

There is a weird problem that I have noticed while training: If my training text includes a space at the beginning or at the end of a line, this cause:

  • Longer training-time.
  • lower recognition-rate.
  • hinder the convergence.

The more instances of having lines that include the space at the beginning or at the end the worse the symptoms become, and would even make the model hallucinate and see spaces everywhere, thus corrupting the model. My guess is that tab also have the same effect and even worst, since it might create confusion between it and space. Solution:

  • Make Tesseract automatically removes spaces from the beginning and the end of the lines before generating the images.
  • Add Tab to the forbidden characters, see wiki for code points. https://en.wikipedia.org/wiki/Tab_key#Unicode

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tesseract-ocr/tesseract/issues/1774, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o0qsz8__fIkz01vCl5Y6H33bjPWmks5uFfEwgaJpZM4VLDfI .

--


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

Shreeshrii avatar Jul 11 '18 15:07 Shreeshrii

@Shreeshrii then just space

ghost avatar Jul 11 '18 15:07 ghost

@stweil,

Should leading and trailing spaces be removed from the GT in tesseract training tool or by https://github.com/tesseract-ocr/tesstrain ?

amitdo avatar Nov 04 '22 06:11 amitdo

https://github.com/tesseract-ocr/tesstrain/search?q=strip

amitdo avatar Nov 04 '22 06:11 amitdo