Can we spliter the text by separators first?
Self Checks
- [X] I have searched for existing issues search for existing issues, including closed ones.
- [X] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [X] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [X] Please do not modify this template :) and fill in all the required fields.
1. Is this request related to a challenge you're experiencing? Tell me about your story.
I built a knowledge base AI Chatbot with DIFY:
- I parepared a well formated text document with separator"\r\n\r\n".
- I loaded it into the DIFY with chunk 500. While testing, I found some information was lost, and it cannot answer the question even I input the exact the same sentence in the document.
After the digging and guessing: I found there are 4 parts in a chunk, and I splitted it into 4 chunks manually. Then the issue was fixed.
2. Additional context or comments
The quality is hightly depends on how we split the text, but DIFY has a few small chunks in a big chunk currently, it will hide a lot of useful information. so can we spliter the text by separators first? for example, as below steps:
- Split the text by level 1 separators fist, such as "------------"
- Check the chunk size, stop if it is less than max chunk size
- Split the text by level 2 separators, such as "\r\n" or "\n"
- Check the chunk size, stop if it is less than max chunk size
- continue as step 3, ... .... , split it by chars or tokens at last.
Further more, it will lost much information if we convert other documents into text file first. Can we split the document into chunks directly according to the source structure of the document, such as docx or PDF structure.
3. Can you help us with this feature?
- [ ] I am interested in contributing to this feature.
I have the same problem. No matter how many times I use a '---' separator, it still divides the segments with double line breaks. It takes a lot of effort to modify the documents and configuration to get an agent to respond correctly. Additionally, if I include a prompt for the bot to ensure it has all the information before responding, it doesn't work. It always takes what it receives from the knowledge base and responds immediately.
The splitter method of Dify is not correct. Generally, the separator is used on the red part, but Dify is used to split large documents.