MorphMan icon indicating copy to clipboard operation
MorphMan copied to clipboard

For Chinese, we should be able to load user dictionary using Jieba

Open aash949 opened this issue 2 years ago • 2 comments

Jieba has a function to load a user's dictionary to make word segmentation more accurate to your dictionary of choice, i.e. the cc-cedict dictionary. Here's the function...

jieba.load_userdict(file_name)

I am proposing that, when Jieba is initialized, we check to see if there is a userdict.txt file in dbs (like frequency.txt) and, if there is a userdict.txt file, we use this function to load the contents of this file before implementing any word segmentation.

I haven't wrote much code since University but I'll check to see if I can implement this change myself.

aash949 avatar Jul 04 '22 23:07 aash949

Are there any news on this?

yaoberh avatar Oct 04 '22 17:10 yaoberh

Are there any news on this?

Implementing this and achieving the desired result (or at least my desired result) could be more complicated than I first thought.

If you load a user dictionary using Jieba before performing word segmentation, it will improve the word segmentation relative to your dictionary which is nice.

However, Jieba will continue to segment words the way it thinks words should be segmented rather than according to your dictionary.

What I find often happens is that Jieba thinks that two words with two separate dictionary entries in your dictionary are actually one longer word (i.e. a portmanteau) but that longer word isn't in your dictionary which can be a bit annoying if you would rather learn the words and their individual meanings separately.

I think the best way to resolve this is to load your dictionary, perform word segmentation, and then check word by word if each word is in your dictionary. If the word is not in your dictionary, use Jieba's del_word(word) function to delete the word (which is probably two words combined without a dictionary entry in your dictionary available) and then try word segmentation again to see if those two words are now separately segmented with a dictionary entry available for each.

I think this would slow things down a lot though.

Perhaps I'm overthinking this.

aash949 avatar Oct 05 '22 22:10 aash949