language-detector Japanese text being identified as Kurmanji

It's probably because of this line. https://github.com/DanielJDufour/language-detector/blob/e960b59f53a41e26d44201d54410fd62299f9b8c/language_detector/prep/char_language.txt#L73

See the following texts:

୨୧譲渡交換୨୧ ツイステ色紙コレクション vol.1 vol.2 譲┊︎デューストレイケイトジャミルオルトシルバー求┊︎同異種リドル or 定価(＋送料) 郵送 or 都内手渡し可能 ⿻ 各1BOX予約済みです。 ⿻…

東映HP更新✨ 来週はガルザとクランチュラがジャメンタルを研究🔍録りおろしナレーションたっぷりでお届けします！そしてHPで #キラトーーク延長戦！？魔進の声を演じるキャストのテンションMAX！なコメントを掲載しております✨ #キラ…

「DXヒューマギアプログライズキーセット」はご予約受付中！シェスタ、腹筋崩壊太郎、マモル、一貫ニギローのデータを宿したプログライズキーのセットです✨ 別売りのDXなりきりシリーズとも連動します。 URL…

Jun 07 '20 02:06 ftkurt

Hi, @ftkurt . Thank you for identifying this issue! This package doesn't support Japanese yet, but it's easy to add a language. Would you like to submit a pull request? The documentation on how to add a language is here: https://github.com/DanielJDufour/language-detector/blob/master/CONTRIBUTING.md

Jun 07 '20 02:06 DanielJDufour

I briefly looked at Japanese character sets, and it seems its a bit different than other languages as they have multiple sets. Therefore, I would rather prefer someone knowledgeable about Japanese do that. However, I am currently working on collecting Sorani and Kurmanji datasets. I might be able to add more data for those two Kurdish dialects in the coming days. I think this will help with making this package more reliable.

Jun 07 '20 14:06 ftkurt

That would be great! Thank you!

Jun 07 '20 16:06 DanielJDufour

language-detector language-detector copied to clipboard

Japanese text being identified as Kurmanji

language-detector
language-detector copied to clipboard