qdrant
qdrant copied to clipboard
Support tokenizers for CJK languages
Is your feature request related to a problem? Please describe. Tokenizer's principal role is to split documents into words (tokens) so each document can be indexed by it's contained words. It is also needed to split a search a query into token, and search on the work index.
Describe the solution you'd like https://github.com/meilisearch/meilisearch/issues/624
Describe alternatives you've considered Use a bert embedding with vector search. (this may not work well for new phrases)
Additional context Add any other context or screenshots about the feature request here.
/bounty $250
~~💎 $250 bounty created by generall~~
~~🙋 If you start working on this, comment /attempt #1909
to notify everyone~~
~~👉 To claim this bounty, submit a pull request that includes the text /claim #1909
somewhere in its body~~
~~📝 Before proceeding, please make sure you can receive payouts in your country~~
~~💵 Payment arrives in your account 2-5 days after the bounty is rewarded~~
~~💯 You keep 100% of the bounty award~~
~~🙏 Thank you for contributing to qdrant/qdrant!~~
Attempt | Started | Solution |
---|---|---|
🟢 @zarkone | Jun 4, 2023 | #2023 |
Should the new CJK tokenizers use the same crates used in the referenced issue?
/attempt #1909
After an initial implementation in https://github.com/qdrant/qdrant/pull/2023 we discovered that this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.
We may revisit this idea in the future.
After an initial implementation in https://github.com/qdrant/qdrant/pull/2023 we discoveredthat this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.
I have watched this issue for a while, and I would agree with @timvisee. I wanted to add some context here. To support Chinese, Japanese, and Korean tokenizers, we essentially need large language-specific dictionaries in addition to the tokenizer itself. It is worth trying to support CJKs for keyword-based search engines (Lucene, Tantivy, or Meilisearch) but in Qdrant, the benefit may not be worth the cost.
I'm just curious - is there a possibility to support somewhat plug & play-style tokenizers for such needs so that users can add 3rd-party or their own tokenizers as their needs? I will dig into this area if I have a chance.
After an initial implementation in #2023 we discovered that this has no place in Qdrant at this time. Supporting this by adding CJK dictionaries double the binary size. That is too significant for the benefits it gives. Other approaches to achieve the same are not any better.
We may revisit this idea in the future.
ands any updates on docker version or tags?