tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Whitelist for Non-English Characters

Open YunTsen opened this issue 3 years ago • 5 comments

Environment

  • Tesseract Version: tesseract v5.0.0-alpha.20200328
  • Platform: Windows, 64-bit

Current Behavior:

1 While using chi_tra to work on this image, the result was "載", which was great. However, after specifying the whielist using following commands: config='--oem 0 --psm 6 -c tessedit_char_whitelist=\u8f09' , (\u8509 is the unicode for "載")or config='--oem 0 --psm 6 -c tessedit_char_whitelist=載' the results turned out to be null.

It seems that whitelist could only accept English characters or digits(whitelist does work for numbers, I have tested that). How come?

p.s. I tried this because I wanted Tesseract to detect only the words on whitelist.

Expected Behavior:

Seems chi_tra could detect "載" accurately without the whitelist, it should also work if whitelist="載" is given.

Suggested Fix:

The variable tessedit_char_whitelist should accept non-English characters.

YunTsen avatar Jul 29 '20 09:07 YunTsen

Actually i have tried with russian characters and it worked pretty well. So, i am assuming that the problem is specific to some subset of the UTF-8

Moldoteck avatar Sep 30 '20 14:09 Moldoteck

我也一样遇到这个问题 请问解决了吗

ChunkyZhang avatar Jan 11 '21 12:01 ChunkyZhang

把系统语言设置成utf8貌似是成功的。

focusexplorer avatar Jun 28 '21 09:06 focusexplorer

tessedit_char_whitelist=\u8f09

AFAIK. this usage is not supported.

Did you tried:

tessedit_char_whitelist=載

?

amitdo avatar Jun 12 '22 23:06 amitdo

Anyway, the aliowlist / denylist feature is known to not work well with the LSTM engine.

amitdo avatar Jun 12 '22 23:06 amitdo