language-detector
language-detector copied to clipboard
Japanese text being identified as Kurmanji
It's probably because of this line. https://github.com/DanielJDufour/language-detector/blob/e960b59f53a41e26d44201d54410fd62299f9b8c/language_detector/prep/char_language.txt#L73
See the following texts:
୨୧譲渡交換୨୧ ツイステ 色紙コレクション vol.1 vol.2 譲┊︎デューストレイケイト ジャミルオルトシルバー 求┊︎同異種リドル or 定価(+送料) 郵送 or 都内手渡し可能 ⿻ 各1BOX予約済みです。 ⿻…
東映HP更新✨ 来週はガルザとクランチュラがジャメンタルを研究🔍録りおろしナレーションたっぷりでお届けします! そしてHPで #キラトーーク 延長戦!? 魔進の声を演じるキャストのテンションMAX!なコメントを掲載しております✨ #キラ…
「DXヒューマギアプログライズキーセット」はご予約受付中!シェスタ、腹筋崩壊太郎、マモル、一貫ニギローのデータを宿したプログライズキーのセットです✨ 別売りのDXなりきりシリーズとも連動します。 URL…
Hi, @ftkurt . Thank you for identifying this issue! This package doesn't support Japanese yet, but it's easy to add a language. Would you like to submit a pull request? The documentation on how to add a language is here: https://github.com/DanielJDufour/language-detector/blob/master/CONTRIBUTING.md
I briefly looked at Japanese character sets, and it seems its a bit different than other languages as they have multiple sets. Therefore, I would rather prefer someone knowledgeable about Japanese do that. However, I am currently working on collecting Sorani and Kurmanji datasets. I might be able to add more data for those two Kurdish dialects in the coming days. I think this will help with making this package more reliable.
That would be great! Thank you!