manticoresearch icon indicating copy to clipboard operation
manticoresearch copied to clipboard

Jieba integration

Open oabu opened this issue 2 years ago • 28 comments

ICU is not a good choice in China. In addition, it is very important for Chinese word segmentation to customize the dictionary, because the application of words in different industries is completely different.
Taking jieba word segmentation as an example, he has a mode called search mode, which is specially prepared for full-text retrieval.

To this end, I made an example, please take a look and you will understand the difference.
http://lx.host.dabai.com/
the FULL result is the correct result

Taking "清华大学" as an example, few people may search for "清华大学", but most of them will use "清华" as a keyword search, so we need both "清华大学" and "清华". @sanikolaev @dzcpy

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

  • [x] Implementation completed
  • [x] Tests developed
  • [x] Documentation updated
  • [x] Documentation reviewed
  • [x] Changelog updated

oabu avatar Nov 08 '22 02:11 oabu

@oabu Thank you for your feedback. So do you recommend adding integration with https://github.com/yanyiwu/cppjieba ?

sanikolaev avatar Nov 15 '22 13:11 sanikolaev

@sanikolaev yes, i hope manticoresearch can integration with jieba, because it does not support chinese word segmentation, I temporarily choose meilisearch.

if you decide to integration with jieba, Here is a nice discussion to refer to

malacca avatar Nov 28 '22 20:11 malacca

@fxtxkktv in https://github.com/manticoresoftware/manticoresearch/issues/1137 expressed his interest in adding Jieba support into Manticore.

sanikolaev avatar May 22 '23 03:05 sanikolaev

there is another repo that is related to Chinese word segmentation. And it was written in C++.

https://github.com/fastcws/fastcws

axhiao avatar May 24 '23 13:05 axhiao

there is another repo that is related to Chinese word segmentation. And it was written in C++.

Jieba seems to be more popular. What are the advantages of this one? Is there any benchmark comparing it with Jieba and/or ICU?

sanikolaev avatar May 24 '23 13:05 sanikolaev

还有另一种与中文分词有关的存储库。它是用C++编写的。

杰霸似乎更受欢迎。这个有什么优点?是否有与杰霸和/或ICU比较的基准?

【jieba】 Custom Chinese word segmentation is useful

fxtxkktv avatar May 25 '23 02:05 fxtxkktv

@sanikolaev hi, is there any plan about using jieba as Chinese text segmentation, the most popular Chinese text segmentation is https://github.com/fxsjy/jieba and it's C++ version is https://github.com/yanyiwu/cppjieba.

jacentsao avatar Jul 18 '23 07:07 jacentsao

This issue won't make it to the upcoming release. Hopefully we'll address this issue in the next release, i.e. in a few months.

sanikolaev avatar Jul 18 '23 08:07 sanikolaev

I think jieba is the current best open source Chinese participle , support for Chinese Simplified Chinese , Chinese Traditional Chinese participle , support for customized thesaurus .

jieba supports three modes of participle : precise mode, full mode and search engine mode. Very suitable for full-text search , I used in es is also jieba @sanikolaev

JonGates avatar Aug 15 '23 13:08 JonGates

@oabu 感谢您的反馈。因此,您是否建议添加与 https://github.com/yanyiwu/cppjieba ?

https://github.com/fxsjy/jieba https://github.com/yanyiwu/cppjieba

oabu avatar Aug 16 '23 01:08 oabu

hi @sanikolaev ,

Do you have any plan or timeline regarding the full integration of Jieba?

Thanks.

jaric avatar Jan 10 '24 04:01 jaric

Hi @jaric

Unfortunately, it's not in our nearest plans, but we are still interested in it. Ideally, we'd like someone to make a pull request or sponsor the development :)

sanikolaev avatar Jan 10 '24 07:01 sanikolaev

This is very important for Chinese developer to choose Manticore。 For now, small company may choose postgresql, and big company stick to Elastic Search。 And I think Meilisearch and Manticore will be The Next Star。 Many friends of mine from startup company recommend Meilisearch, for the easy of use and Chinese support. I personally prefer Manticore for the SQL-first,but disappointed by the absent of Jieba support. This is not so hard, but absolutely important!

thegenius avatar Feb 01 '24 23:02 thegenius

@thegenius thanks for the comment. I've added this task to the roadmap - https://roadmap.manticoresearch.com/

sanikolaev avatar Feb 02 '24 16:02 sanikolaev

jieba 对中文来说很重要,希望早一些可以用上。

xzxiaoshan avatar Mar 28 '24 04:03 xzxiaoshan

is there any news on this topic? lot's of startup companies are waiting for this feature

smellbee avatar May 17 '24 02:05 smellbee

is there any news on this topic?

@smellbee, unfortunately, there is no significant progress on this topic yet, except that we now have a better understanding of how this can be integrated internally. Regretfully, none of those startup companies have been willing to sponsor the development. For more information, you can visit: https://manticoresearch.com/services/

sanikolaev avatar May 17 '24 02:05 sanikolaev

Regretfully, none of those startup companies have been willing to sponsor the development. For more information, you can visit: https://manticoresearch.com/services/

I think those startup companies are not economically guaranteed. or they are too weak now. To adopt new technical solutions is experimental and risky, so persuading them to change is not very easy. Most of them can only follow other majority's old but widely-known solutions.

but if there are some key features (which is important to their bussiness) , it might be possible to trigger them to have a try.
once they get some benefit, maybe like hardware cost reduction, or easy implementation of bussness features, I think they might have a real willing to feed back, like sponsorship. or even investment.

In my opinion, if we wanna target the market which have a large number of potential customers, this feature could be of a little importance. there are lots of bigger or giant companies focusing on Chinese market need better DB solutions, these are potential big Donors or investers. I am 100% sure if this feature released, will have some hits to draw their attention.

smellbee avatar May 17 '24 03:05 smellbee

I have been following this project for quite some time, but haven't used it because the Chinese word segmentation support was not very user-friendly. I remember that https://github.com/veelion/manticoresearch-seg provides Chinese word segmentation support. I wonder why the official team hasn't incorporated this project. https://github.com/manticoresoftware/manticoresearch/pull/175

lgl5240 avatar May 21 '24 14:05 lgl5240

Hopefully we'll have time to integrate with Jieba in a few weeks. There are two major tasks to finish before it:

  • https://github.com/manticoresoftware/manticoresearch/issues/1928 - in progress
  • https://github.com/manticoresoftware/manticoresearch/issues/1673 (push it from beta to stable)

sanikolaev avatar May 23 '24 15:05 sanikolaev

I am 100% sure if this feature released, will have some hits to draw their attention.

Is there any news on this topic? How is the progress of word segmentation support for Chinese now? @sanikolaev @glookka

Tptogiar avatar Jul 22 '24 03:07 Tptogiar

Unfortunately, the tasks mentioned above, involving JOIN and secondary indexes, took much longer than expected. The Jieba integration is still in our near-term plans and on the roadmap.

sanikolaev avatar Jul 22 '24 14:07 sanikolaev

FYI: we are finally working on this task.

sanikolaev avatar Aug 21 '24 05:08 sanikolaev

Done in 786cc198de08c1abd25e446fdb50a2c527258f91

glookka avatar Sep 25 '24 15:09 glookka

Reopening to do better testing.

sanikolaev avatar Sep 26 '24 04:09 sanikolaev

@PavelShilin89 pls go ahead with testing the new functionality.

sanikolaev avatar Sep 26 '24 10:09 sanikolaev