tesseract
tesseract copied to clipboard

Published 20 hours ago •

Reame
Issues

Whitelist for Non-English Characters

Open YunTsen opened this issue 3 years ago • 5 comments

Environment

Tesseract Version: tesseract v5.0.0-alpha.20200328
Platform: Windows, 64-bit

Current Behavior:

While using chi_tra to work on this image, the result was "載", which was great. However, after specifying the whielist using following commands: config='--oem 0 --psm 6 -c tessedit_char_whitelist=\u8f09' , (\u8509 is the unicode for "載")or config='--oem 0 --psm 6 -c tessedit_char_whitelist=載' the results turned out to be null.

It seems that whitelist could only accept English characters or digits(whitelist does work for numbers, I have tested that). How come?

p.s. I tried this because I wanted Tesseract to detect only the words on whitelist.

Expected Behavior:

Seems chi_tra could detect "載" accurately without the whitelist, it should also work if whitelist="載" is given.

Suggested Fix:

The variable tessedit_char_whitelist should accept non-English characters.

Jul 29 '20 09:07 YunTsen

Actually i have tried with russian characters and it worked pretty well. So, i am assuming that the problem is specific to some subset of the UTF-8

Sep 30 '20 14:09 Moldoteck

我也一样遇到这个问题请问解决了吗

Jan 11 '21 12:01 ChunkyZhang

把系统语言设置成utf8貌似是成功的。

Jun 28 '21 09:06 focusexplorer

tessedit_char_whitelist=\u8f09

AFAIK. this usage is not supported.

Did you tried:

tessedit_char_whitelist=載

?

Jun 12 '22 23:06 amitdo

Anyway, the aliowlist / denylist feature is known to not work well with the LSTM engine.

Jun 12 '22 23:06 amitdo