jieba icon indicating copy to clipboard operation
jieba copied to clipboard

'是因为' doesn't cut as expected

Open brownbat opened this issue 1 year ago • 2 comments

What's the best way to get jieba to cut '是因为' into '是' and '因为'?

I was processing 影子的出现是因为有光 to tag the sentence for rare words and it scored much rarer than expected because of the 是因为 token.

Cut for search on 是因为 gives ['因为', '是因为'] -- how often do the jieba cut functions duplicate the input like that? Is that by design? It was a little surprising, but maybe that's part of how that function is designed for search engines, I'm not sure.

Setting HMM to False gives ['影子', '的', '出现', '是因为', '有', '光']

Unsure if this is a bug or by design. Is the right approach here to use a custom user dictionary limited to the top 20k words or so?

Apologies if this is pure user error, I am new to jieba and still trying to figure out all the features. Thanks for any recommendations.

brownbat avatar Apr 02 '23 05:04 brownbat