mecab
mecab copied to clipboard
Fix unk.def in mecab-ipadic
Github isn't showing the file properly, so to be clear, I changed this line:
SYMBOL,1283,1283,17585,名詞,サ変接続,*,*,*,*,*
to this:
SYMBOL,1283,1283,17585,記号,一般,*,*,*,*,*
The previous setting makes no sense and has confused many people. I guess it was a mistake?
- mecabで半角記号が名詞,サ変接続になるのを解決する : nymemo
- MeCabさんが記号を「サ変接続」と認識してしまう - BlankTar
- Mecab - 記号がサ変接続の名詞になってしまう(986)|teratail
- MeCabの未知語(unk.def)と戯れた記録 : mwSoft blog
The jumandic unk.def
did not seem to have this problem.
If there's anything I should improve, please let me know.
Many thanks for providing Mecab.
Hello. This PR has been here for over a year, it would be great to have it addressed one way or another.
I will add that I realized why the current setting is in place. There's a footnote in "Applying Conditional Random Fields to Japanese Morphological Analysis" that explains it:
JUMAN assigns “unknown POS” to the words not seen in the lexicon. We simply replace the POS of these words with the default POS, Noun-SAHEN.
While that sounds reasonable, the articles I linked above and the issue that has been linked to this PR since I originally posted it show that this setting causes confusion and I still think it should be changed.
Any feedback at all would be appreciated.