Langchain-Chatchat Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge

trafficstars

I wrote a new splitter to improve the processing of QA-type knowledge(Now only supports JSON, as shown in the example). I also added a Toggle button on the knowledge_base page to switch between the QA splitter and the normal splitter (ChineseRecursiveTextSplitter defined in kb_config.py).

I created a PR because I noticed that many people are encountering the same issue (#3164, #893, and others).

Here are the updated page and test results for the QA splitter:

Mar 13 '24 04:03 Donovan-Ye

你好，我按照这个代码改了，最后分词还是走了ChineseRecursiveTextSplitter，我看你的截图也是

Apr 18 '24 08:04 chuanSir123

你好，我按照这个代码改了，最后分词还是走了ChineseRecursiveTextSplitter，我看你的截图也是

搞定了，是网络问题导致默认选择了其他分词器。

Apr 18 '24 08:04 chuanSir123

你好，我按照这个代码改了，最后分词还是走了ChineseRecursiveTextSplitter，我看你的截图也是

搞定了，是网络问题导致默认选择了其他分词器。

嗯嗯是的。连不上huggingface的会走默认分词器

Apr 19 '24 01:04 Donovan-Ye

您好，我想问下，自己定义了qa_text_splitter.py，那为什么还需要联网走huggingface？这块不是很了解

可以跟着上传文件向量化的逻辑看，中间会走到这里： https://github.com/Donovan-Ye/Langchain-Chatchat/blob/2ef5d1fafe164797151ad79c8c42f04e39cc4876/server/knowledge_base/utils.py#L189

可以发现会去根据source和tokenizer_name_or_path去加载分词器和tokenizer。。。因为我设置的qa_text_splitter的source是huggingface，所以会走这里的逻辑，去加载对应的tokenizer。如果加载错误就会走下面的catch，去拿默认的分词器。

我也没有特别深入的去研究过，就我这边的使用场景来说：1. 如果使用的是本地模型，tokenizer_name_or_path设置为''。 2. 如果是走openai 的 api，tokenizer_name_or_path设置为gpt2。

不过刚才我仔细看了一下，你可以尝试将source设置为''试试。因为看到还有一个else的逻辑。

try:
  # ...
  if text_splitter_dict[splitter_name]["source"] == "tiktoken":  ## 从tiktoken加载
    # ...
  elif text_splitter_dict[splitter_name]["source"] == "huggingface":  ## 从huggingface加载
    # ...
  else:
      try:
          text_splitter = TextSplitter(
              pipeline="zh_core_web_sm",
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap
          )
      except:
          text_splitter = TextSplitter(
              chunk_size=chunk_size,
              chunk_overlap=chunk_overlap
          )
except Exception as e:
        print(e)
        text_splitter_module = importlib.import_module('langchain.text_splitter')
        TextSplitter = getattr(text_splitter_module, "RecursiveCharacterTextSplitter")
        text_splitter = TextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
# ...

Apr 23 '24 08:04 Donovan-Ye

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

Apr 23 '24 08:04 nauyiahc

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

Apr 23 '24 09:04 Donovan-Ye

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

我简单实现了一下，在base.py的EmbeddingsFunAdapter的embed_documents方法中，在向量化时用正则表达式把texts的question给提取了出来，这样就可以做到只向量化question

Apr 23 '24 10:04 nauyiahc

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

我简单实现了一下，在base.py的EmbeddingsFunAdapter的embed_documents方法中，在向量化时用正则表达式把texts的question给提取了出来，这样就可以做到只向量化question

仅对问题进行向量化，我在embed_documents方法加入了如下函数

Apr 23 '24 10:04 nauyiahc

感觉texts直接转字典，然后把question的value取出来也可以，用try来取，我是想在数据库初始化和增量更新时做这个事情，暂时没有考虑前端页面，只向量化问题，检索的阈值就可以设置得更低一些，匹配的更精准

Apr 23 '24 10:04 nauyiahc

您好，想问下，我在初始化数据库中，用了qa_text_splitter.py，但只想向量化question的部分，不想向量化answer，这个该如何实现呢？我现在用qa_text_splitter.py之后，是对整个q-a进行了向量化。。。

是指只要这部分嘛？

我简单实现了一下，在base.py的EmbeddingsFunAdapter的embed_documents方法中，在向量化时用正则表达式把texts的question给提取了出来，这样就可以做到只向量化question

仅对问题进行向量化，我在embed_documents方法加入了如下函数

大佬，我按照你的代码位置改了，好像没触发print，确定这个qa模式是走的这个方法么

Apr 29 '24 02:04 chuanSir123

很抱歉，我们不会接受该PR，因为其普适性低。

Jun 15 '24 05:06 liunux4odoo

Langchain-Chatchat
Langchain-Chatchat copied to clipboard

Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page

Langchain-Chatchat Langchain-Chatchat copied to clipboard

Add new splitter to process QA type file(now only support JSON) and add Toggle button in knowledge_base page

Langchain-Chatchat
Langchain-Chatchat copied to clipboard