SudachiDict icon indicating copy to clipboard operation
SudachiDict copied to clipboard

Contains many hangeul terms in notcore_lex.csv

Open hanya opened this issue 2 years ago • 1 comments

There are some hungeul terms can be found in notcore_lex.csv file. Such as follows:

전범국,4785,4785,22000,전범국,名詞,固有名詞,一般,*,*,*,センパンコク,戦犯国,*,A,*,*,*,*
전지충이,4785,4785,22000,전지충이,名詞,固有名詞,一般,*,*,*,チョンジチュンイ,デンヂムシ,*,A,*,*,*,*
전툴라,4785,4785,22000,전툴라,名詞,固有名詞,一般,*,*,*,チョントゥラ,チョントゥラ,*,A,*,*,*,*

Are they intentionally contained?

hanya avatar Sep 30 '21 13:09 hanya

Thank you for your inquiry.

In Sudachi dictionary, three types of words are registered. That is, ・words from UniDic ・words from NEologd ・words we collected Hangeul terms were contained in NEologd. Regarding UniDic words and NEologd words , we have not scrutinized them in particular so far. Looking at registered Hangeul terms, most of the them are Pokemon names. As Hangeul is written in katakana in Japanese sentences, we are considering removing them.

sakamoto-mi avatar Oct 01 '21 04:10 sakamoto-mi