tessdata_best Portuguese trained data fails to recognize @

Portuguese trained data fails to recognize @

Open jgsmarques opened this issue 4 years ago • 3 comments

Hi,

I'm using tesseract to perform OCR in a document that contains an email address. Using the eng trained data, it recognizes the email address correctly (but naturally fails in all accented characters). When I switched to the por trained data, it picks up all portuguese characters correctly, but fails to recognize the @ in email addresses. Is there any additional configuration needed for this special characters?

Thank you!

Apr 19 '20 14:04 jgsmarques

I present the same problem. I try to use multiple languages, but it doesn't work.

Oct 08 '21 18:10 LilianeAquino

#54 describes a similar problem for Turkish. It's likely that the training data for the two languages didn't include any examples of email addresses.

Oct 15 '23 22:10 tfmorris

That's correct. See por.unicharset and tur.unicharset which do not contain the @ character. So that character was not part of the training data and will therefore never be recognized by those models.

I suggest to use -l Latin in similar cases.

Oct 16 '23 05:10 stweil

tessdata_best tessdata_best copied to clipboard

Portuguese trained data fails to recognize @

tessdata_best
tessdata_best copied to clipboard