ragflow icon indicating copy to clipboard operation
ragflow copied to clipboard

[Feature Request]: Use the tokenizer configured in ES instead of tokenizing in ragflow itself

Open childe opened this issue 8 months ago • 2 comments

Self Checks

  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (Language Policy).
  • [x] Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • [x] Please do not modify this template :) and fill in all the required fields.

Is your feature request related to a problem?

I search "磁盘raid10如何删".

ragflow split question to `磁盘` `raid10` `删` and use matchText to do search as below:


        "must": [
          {
            "query_string": {
              "fields": [
                "title_tks^10",
                "title_sm_tks^5",
                "important_kwd^30",
                "important_tks^20",
                "question_tks^20",
                "content_ltks^2",
                "content_sm_ltks"
              ],
              "type": "best_fields",
              "query": "((磁盘)^0.4945343603291895 (raid10)^0.26263523162397767 (删)^0.24283040804683279 (\"磁盘 raid10 删\"~2)^1.5)",
              "minimum_should_match": "30%",
              "boost": 1
            }
          }
        ]


However, chunk content is `删除raid` which will be tokenized as `删除` and `raid`. So the chunk I need do NOT return.

So I think allowing user to config and use tokenizer in ES is more reflexible and could be more accurate. 

And more easy to use: just config mapping in es, ragflow do not need to tokenize in bulk and search.

Describe the feature you'd like

Use tokenizer in ES for such fields: title, question, content and so on instead of tokenizing in ragflow itself

Describe implementation you've considered

No response

Documentation, adoption, use case


Additional information

No response

childe avatar Apr 29 '25 10:04 childe

Thanks for your suggestion! We’ll be evaluating the necessity of implementing support for custom Elasticsearch tokenizer based on real-world needs and usage scenarios.. In the meantime, you can improve recall by adding tags or other metadata directly to the chunks. Let us know if you run into any issues or have more ideas — we really appreciate the feedback! @childe

which-W avatar May 09 '25 07:05 which-W

Hello @childe ! I’m the product manager. I’m currently exploring how we can provide more capabilities for full-text search, and I came across your feature request. I’d love to understand why you need this feature. Does your company use its own custom tokenization strategy?

ZhenhangTung avatar Dec 08 '25 06:12 ZhenhangTung