franc
franc copied to clipboard
Some Chinese sentences are detected as Japanese
sentence 1
特別推薦的必訪店家「ヤマシロヤ」,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面
jpn 1
google translate result is Chinese correctly
sentence 2
特別推薦的必訪店家,雖然不在阿美橫町上,但就位於JR上野站廣小路口對面
cmn 1
google translate result is Chinese correctly
Sentence 1 almost are Chinese characters and contains 5 Katakana characters. But its result is jpn
incorrectly.
Sentence 2 are Chinese characters fully, and its result is cmn
correctly.
Maybe the result is related to #77
Thanks. I don’t read, write, or speak Japanese or Chine so I can’t really help. PRs like with GH-77 are welcome!
Hi @wooorm, @the-worldly-monkey
From https://www.unicode.org/faq/han_cjk.html#4 (How can I recognize from the 32 bit value of a Unicode character if this is a Chinese, Korean or Japanese character?)
A better solution is to look at the text as a whole: if there's a fair amount of kana, it's probably Japanese, and if there's a fair amount of hangul, it's probably Korean.
According to url, I will add some extra rules to getTopScript(value, scripts)
when detect CJK sentence.
@kewang PR would be great on this!!