franc Some Chinese sentences are detected as Japanese

Some Chinese sentences are detected as Japanese

Open kewang opened this issue 4 years ago • 3 comments

sentence 1

特別推薦的必訪店家「ヤマシロヤ」，雖然不在阿美橫町上，但就位於JR上野站廣小路口對面

jpn 1
google translate result is Chinese correctly

sentence 2

特別推薦的必訪店家，雖然不在阿美橫町上，但就位於JR上野站廣小路口對面

cmn 1
google translate result is Chinese correctly

Sentence 1 almost are Chinese characters and contains 5 Katakana characters. But its result is jpn incorrectly.

Sentence 2 are Chinese characters fully, and its result is cmn correctly.

Maybe the result is related to #77

Apr 07 '20 14:04 kewang

Thanks. I don’t read, write, or speak Japanese or Chine so I can’t really help. PRs like with GH-77 are welcome!

Apr 07 '20 19:04 wooorm

Hi @wooorm, @the-worldly-monkey

From https://www.unicode.org/faq/han_cjk.html#4 (How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?)

A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.

According to url, I will add some extra rules to getTopScript(value, scripts) when detect CJK sentence.

Apr 12 '20 12:04 kewang

@kewang PR would be great on this!!

Jun 07 '20 22:06 niftylettuce

franc franc copied to clipboard

Some Chinese sentences are detected as Japanese

sentence 1

sentence 2

franc
franc copied to clipboard