qdrant Support tokenizers for CJK languages

Is your feature request related to a problem? Please describe. Tokenizer's principal role is to split documents into words (tokens) so each document can be indexed by it's contained words. It is also needed to split a search a query into token, and search on the work index.

Describe the solution you'd like https://github.com/meilisearch/meilisearch/issues/624

Describe alternatives you've considered Use a bert embedding with vector search. (this may not work well for new phrases)

Additional context Add any other context or screenshots about the feature request here.

May 17 '23 04:05 nick008a

/bounty $250

May 24 '23 10:05 generall

~~💎 $250 bounty created by generall~~ ~~🙋 If you start working on this, comment /attempt #1909 to notify everyone~~ ~~👉 To claim this bounty, submit a pull request that includes the text /claim #1909 somewhere in its body~~ ~~📝 Before proceeding, please make sure you can receive payouts in your country~~ ~~💵 Payment arrives in your account 2-5 days after the bounty is rewarded~~ ~~💯 You keep 100% of the bounty award~~ ~~🙏 Thank you for contributing to qdrant/qdrant!~~

Attempt	Started	Solution
🟢 @zarkone	Jun 4, 2023	#2023

May 24 '23 10:05 algora-pbc[bot]

Should the new CJK tokenizers use the same crates used in the referenced issue?

May 25 '23 09:05 ibrahim-akrab

/attempt #1909

Jun 04 '23 20:06 zarkone

After an initial implementation in https://github.com/qdrant/qdrant/pull/2023 we discovered that this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.

We may revisit this idea in the future.

Jul 13 '23 12:07 timvisee

After an initial implementation in https://github.com/qdrant/qdrant/pull/2023 we discoveredthat this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.

I have watched this issue for a while, and I would agree with @timvisee. I wanted to add some context here. To support Chinese, Japanese, and Korean tokenizers, we essentially need large language-specific dictionaries in addition to the tokenizer itself. It is worth trying to support CJKs for keyword-based search engines (Lucene, Tantivy, or Meilisearch) but in Qdrant, the benefit may not be worth the cost.

I'm just curious - is there a possibility to support somewhat plug & play-style tokenizers for such needs so that users can add 3rd-party or their own tokenizers as their needs? I will dig into this area if I have a chance.

Jul 17 '23 03:07 mocobeta

After an initial implementation in #2023 we discovered that this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.

We may revisit this idea in the future.

ands any updates on docker version or tags?

Apr 15 '24 05:04 yangboz

qdrant qdrant copied to clipboard

Support tokenizers for CJK languages

qdrant
qdrant copied to clipboard