RediSearch icon indicating copy to clipboard operation
RediSearch copied to clipboard

Advices on tokenizer

Open lionsoul2014 opened this issue 6 years ago • 7 comments

It's so great that RediSearch could be another choice for full-text search since i am an Elasticsearch fans and a search engineer for almost ten years.

I am the author of Friso and actually Friso could do lot more like:

  1. Chinese tokenization for both Simplified Chinese and Traditional Chinese (Simply by convert the Simplified Chinese lexicon to Traditional Chinese).
  2. Japanese tokenization. (The lexicons of Japanese needed)
  3. Korean tokenization. (The lexicons of Korean needed)
  4. Latin characters process, like Case conversion, Full-width and Half-width convertion, Combination word recognition etc ... Features which are very important to guaranteed search results.
  5. Synonyms management and append automatically.
  6. Multiple word segmentation algorithms and modes design for different usage scenarios (The user of RediSearch may know the best choice).

Since tokenization component is VERY important for a full-text search product, So, Here are some of my advices on the tokenizer module for RediSearch which may make it more powerfull for full-text search:

1. Tokenizer plugin or module support: we have many well-know tokenizer implementations design for different usage scenarios, some for more accurate word segmentation results, some for better performance and some may ONLY build for search.

So, it could be great that RediSearch could implement a common wrapper to allow more third-party tokenizer to plug in, these could be tokenizers for all kinds of other languages like Hindi, Russian etc...

2. Multiple tokenizer instance support: In a search product we usually use different tokenizer for different fields or one tokenizer for documents indexing and another for documents retrieving.

So, it would be nice if we could define tokenizer ourself with a string identifier for different use like: friso_complex: friso tokenizer with complex word segmentation algorithm and related configurations. friso_search: friso tokenizer with search word segmentation algorithm and related configurations. for_index: whatever tokenizer made for document indexing. for_search: whatever tokenizer made for documents retrieving. ... more of tokenizer define

during indexing time i could choice to use the "friso_search" tokenizer and "friso_complex" for search ONLY. (Just like the elasticsearch do)

3. Tokenizer level Synonyms support: We may have to manage Millions of lexicons or dictionaies for a large search product. including Synonyms, stopwords, etc ...

So, it could be perfect to support tokenizer level Synonyms so Corpus Management Team could manage their the lexicons or dictionaries in a more efficient way.

In elasticsearch(Acutally lucene do the work), tokens produced by tokenizer with ZERO position increasement would be consider as Synonyms. You know with this feature we could:

  1. Do the Cross-Language Search, like search "google" to match Chinese document contains "谷歌", Japanese documents contains "グーグル" etc ...

  2. Do the Synonyms mapping search, like search "wechat" to match documents contains "微信", "weixin", "v信", "w信", or search "love" to match documents contains "like" or "enjoy" etc ...

or event search "impl" to match documents contains "implement" or "implements" if the user want RediSearch to.

If these adivces are something RediSearch might consider, it would be glad for me to take some time to push some PRs since i think RediSearch could be another Elasticsearch's alternative but faster and more lightweight.

lionsoul2014 avatar Feb 17 '20 05:02 lionsoul2014

Hello @lionsoul2014

We have plans to expand RediSearch tokenization options but no clear roadmap at the moment. Your offer clearly shows deep understanding of the topic and its use-cases.

We will be happy to get a PR that would fit within our overall design and code conventions. Therefore, we would like to start a design discussion that can take place on this thread. Can this work for you?

Kind regards,

Ariel

P.S. RediSearch is licensed under REDIS SOURCE AVAILABLE LICENSE. Its main limitation is for offering database products but is free to use to most other products. Please look at the license and make sure it will not prevent you from using it in your offers. Please let me know if you need additional information on the topic.

ashtul avatar Feb 20 '20 13:02 ashtul

@ashtul I've seen the license already, it is OK for me since i will not build any database products. I will ONLY use RediSearch as a Database. And i am sick and tired of many Cloud Computing Service Providers which just enjoy the commercial convenience of open source software but never contribute anything.

I have not started to study the source code yet but the discussion of roadmap, functions, interfaces or implementation designs would be alright for me.

lionsoul2014 avatar Feb 21 '20 02:02 lionsoul2014

Is there any updates of this issue? Thanks.

ZengyiZhao avatar Jul 12 '23 09:07 ZengyiZhao

+1 need Japanese and Korean. must have!

nikolaydubina avatar Dec 05 '24 07:12 nikolaydubina

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Apr 08 '25 02:04 github-actions[bot]

yeah, I switched to Elasticsearch in the end

nikolaydubina avatar Apr 08 '25 04:04 nikolaydubina

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] avatar Jun 08 '25 02:06 github-actions[bot]