qdrant icon indicating copy to clipboard operation
qdrant copied to clipboard

Support tokenizers for CJK languages

Open nick008a opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. Tokenizer's principal role is to split documents into words (tokens) so each document can be indexed by it's contained words. It is also needed to split a search a query into token, and search on the work index.

Describe the solution you'd like https://github.com/meilisearch/meilisearch/issues/624

Describe alternatives you've considered Use a bert embedding with vector search. (this may not work well for new phrases)

Additional context Add any other context or screenshots about the feature request here.

nick008a avatar May 17 '23 04:05 nick008a

/bounty $250

generall avatar May 24 '23 10:05 generall

~~💎 $250 bounty created by generall~~ ~~🙋 If you start working on this, comment /attempt #1909 to notify everyone~~ ~~👉 To claim this bounty, submit a pull request that includes the text /claim #1909 somewhere in its body~~ ~~📝 Before proceeding, please make sure you can receive payouts in your country~~ ~~💵 Payment arrives in your account 2-5 days after the bounty is rewarded~~ ~~💯 You keep 100% of the bounty award~~ ~~🙏 Thank you for contributing to qdrant/qdrant!~~

Attempt Started Solution
🟢 @zarkone Jun 4, 2023 #2023

algora-pbc[bot] avatar May 24 '23 10:05 algora-pbc[bot]

Should the new CJK tokenizers use the same crates used in the referenced issue?

ibrahim-akrab avatar May 25 '23 09:05 ibrahim-akrab

/attempt #1909

zarkone avatar Jun 04 '23 20:06 zarkone

After an initial implementation in https://github.com/qdrant/qdrant/pull/2023 we discovered that this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.

We may revisit this idea in the future.

timvisee avatar Jul 13 '23 12:07 timvisee

After an initial implementation in https://github.com/qdrant/qdrant/pull/2023 we discoveredthat this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.

I have watched this issue for a while, and I would agree with @timvisee. I wanted to add some context here. To support Chinese, Japanese, and Korean tokenizers, we essentially need large language-specific dictionaries in addition to the tokenizer itself. It is worth trying to support CJKs for keyword-based search engines (Lucene, Tantivy, or Meilisearch) but in Qdrant, the benefit may not be worth the cost.

I'm just curious - is there a possibility to support somewhat plug & play-style tokenizers for such needs so that users can add 3rd-party or their own tokenizers as their needs? I will dig into this area if I have a chance.

mocobeta avatar Jul 17 '23 03:07 mocobeta

After an initial implementation in #2023 we discovered that this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.

We may revisit this idea in the future.

ands any updates on docker version or tags?

yangboz avatar Apr 15 '24 05:04 yangboz