[Feature Request]: Use the tokenizer configured in ES instead of tokenizing in ragflow itself
Self Checks
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (Language Policy).
- [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
- [x] Please do not modify this template :) and fill in all the required fields.
Is your feature request related to a problem?
I search "磁盘raid10如何删".
ragflow split question to `磁盘` `raid10` `删` and use matchText to do search as below:
"must": [
{
"query_string": {
"fields": [
"title_tks^10",
"title_sm_tks^5",
"important_kwd^30",
"important_tks^20",
"question_tks^20",
"content_ltks^2",
"content_sm_ltks"
],
"type": "best_fields",
"query": "((磁盘)^0.4945343603291895 (raid10)^0.26263523162397767 (删)^0.24283040804683279 (\"磁盘 raid10 删\"~2)^1.5)",
"minimum_should_match": "30%",
"boost": 1
}
}
]
However, chunk content is `删除raid` which will be tokenized as `删除` and `raid`. So the chunk I need do NOT return.
So I think allowing user to config and use tokenizer in ES is more reflexible and could be more accurate.
And more easy to use: just config mapping in es, ragflow do not need to tokenize in bulk and search.
Describe the feature you'd like
Use tokenizer in ES for such fields: title, question, content and so on instead of tokenizing in ragflow itself
Describe implementation you've considered
No response
Documentation, adoption, use case
Additional information
No response
Thanks for your suggestion! We’ll be evaluating the necessity of implementing support for custom Elasticsearch tokenizer based on real-world needs and usage scenarios.. In the meantime, you can improve recall by adding tags or other metadata directly to the chunks. Let us know if you run into any issues or have more ideas — we really appreciate the feedback! @childe
Hello @childe ! I’m the product manager. I’m currently exploring how we can provide more capabilities for full-text search, and I came across your feature request. I’d love to understand why you need this feature. Does your company use its own custom tokenization strategy?