language-detector icon indicating copy to clipboard operation
language-detector copied to clipboard

Japanese text being identified as Kurmanji

Open ftkurt opened this issue 5 years ago • 3 comments

It's probably because of this line. https://github.com/DanielJDufour/language-detector/blob/e960b59f53a41e26d44201d54410fd62299f9b8c/language_detector/prep/char_language.txt#L73

See the following texts:

୨୧譲渡交換୨୧ ツイステ 色紙コレクション vol.1 vol.2 譲┊︎デューストレイケイト ジャミルオルトシルバー 求┊︎同異種リドル or 定価(+送料) 郵送 or 都内手渡し可能 ⿻ 各1BOX予約済みです。 ⿻…

東映HP更新✨ 来週はガルザとクランチュラがジャメンタルを研究🔍録りおろしナレーションたっぷりでお届けします! そしてHPで #キラトーーク 延長戦!? 魔進の声を演じるキャストのテンションMAX!なコメントを掲載しております✨ #キラ…

「DXヒューマギアプログライズキーセット」はご予約受付中!シェスタ、腹筋崩壊太郎、マモル、一貫ニギローのデータを宿したプログライズキーのセットです✨ 別売りのDXなりきりシリーズとも連動します。 URL…

ftkurt avatar Jun 07 '20 02:06 ftkurt

Hi, @ftkurt . Thank you for identifying this issue! This package doesn't support Japanese yet, but it's easy to add a language. Would you like to submit a pull request? The documentation on how to add a language is here: https://github.com/DanielJDufour/language-detector/blob/master/CONTRIBUTING.md

DanielJDufour avatar Jun 07 '20 02:06 DanielJDufour

I briefly looked at Japanese character sets, and it seems its a bit different than other languages as they have multiple sets. Therefore, I would rather prefer someone knowledgeable about Japanese do that. However, I am currently working on collecting Sorani and Kurmanji datasets. I might be able to add more data for those two Kurdish dialects in the coming days. I think this will help with making this package more reliable.

ftkurt avatar Jun 07 '20 14:06 ftkurt

That would be great! Thank you!

DanielJDufour avatar Jun 07 '20 16:06 DanielJDufour