jieba
jieba copied to clipboard
'是因为' doesn't cut as expected
What's the best way to get jieba to cut '是因为' into '是' and '因为'?
I was processing 影子的出现是因为有光 to tag the sentence for rare words and it scored much rarer than expected because of the 是因为 token.
Cut for search on 是因为 gives ['因为', '是因为'] -- how often do the jieba cut functions duplicate the input like that? Is that by design? It was a little surprising, but maybe that's part of how that function is designed for search engines, I'm not sure.
Setting HMM to False gives ['影子', '的', '出现', '是因为', '有', '光']
Unsure if this is a bug or by design. Is the right approach here to use a custom user dictionary limited to the top 20k words or so?
Apologies if this is pure user error, I am new to jieba and still trying to figure out all the features. Thanks for any recommendations.