mecab icon indicating copy to clipboard operation
mecab copied to clipboard

Fix unk.def in mecab-ipadic

Open polm opened this issue 6 years ago • 1 comments

Github isn't showing the file properly, so to be clear, I changed this line:

SYMBOL,1283,1283,17585,名詞,サ変接続,*,*,*,*,*

to this:

SYMBOL,1283,1283,17585,記号,一般,*,*,*,*,*

The previous setting makes no sense and has confused many people. I guess it was a mistake?

The jumandic unk.def did not seem to have this problem.

If there's anything I should improve, please let me know.

Many thanks for providing Mecab.

polm avatar Jul 18 '17 15:07 polm

Hello. This PR has been here for over a year, it would be great to have it addressed one way or another.

I will add that I realized why the current setting is in place. There's a footnote in "Applying Conditional Random Fields to Japanese Morphological Analysis" that explains it:

JUMAN assigns “unknown POS” to the words not seen in the lexicon. We simply replace the POS of these words with the default POS, Noun-SAHEN.

While that sounds reasonable, the articles I linked above and the issue that has been linked to this PR since I originally posted it show that this setting causes confusion and I still think it should be changed.

Any feedback at all would be appreciated.

polm avatar Mar 16 '19 10:03 polm