Issue with chunk splitting using parent-child mode in a Markdown file
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
0.15.2
Cloud or Self Hosted
Cloud
Steps to reproduce
I am currently attempting to create a Knowledge Base using the parent-child mode with the following Markdown file.
The file contains 700 entries, each separated by a # and composed of a title, a URL, and corresponding text content. I configured the chunk splitting as follows, using these regular expressions:
Parent chunk identifier: (^# .(?:\n(?!# ).)) Child chunk identifier: (^## .+)|(\n{2,})|(^\s[-*+]\s)|(^|.+|)
The expected result was a Knowledge Base where the parent chunk consists of the "title and URL" and the child chunk consists of the "text content." However, the actual result included multiple URLs within a single chunk, which did not meet the expected structure.
I also tested simpler regular expressions, but the result remained the same. Below is a simplified example of the Markdown file (structure only):
Title1
URL1
TEXT1
Title2
URL2
TEXT2
...
Title700
URL700
TEXT700
I would like to request guidance on how to create a Knowledge Base using this Markdown file, where the parent chunk consists of the "title and URL" and the child chunk consists of the "text content." This can be achieved using the parent-child mode or any other feasible method.
✔️ Expected Behavior
Markdown file must be created where the parent chunk consists of the "title and URL" and the child chunk consists of the "text content."
❌ Actual Behavior
Markdown file was created as the actual result included multiple URLs within a single chunk.
I found a similar unsolved discussion that might be relevant to your issue: HelpDeveloping a Business Chatbot with Dify and Challenges in Optimizing Knowledge Base Chunking [1].
To address your issue with chunk splitting in parent-child mode, you can try the following approach:
-
Parent Chunk Configuration:
- Use a regex pattern that accurately captures the title and URL as a single parent chunk. Your current regex
(^# .*(?:\n(?!# ).*)*)might be too broad. Consider simplifying it to capture only the title and URL, such as:^# (.+)\n(.+) - This pattern assumes each entry starts with a
#followed by the title and URL on separate lines.
- Use a regex pattern that accurately captures the title and URL as a single parent chunk. Your current regex
-
Child Chunk Configuration:
- For the child chunks, ensure that the regex captures the text content correctly. You might want to simplify the child chunk identifier to focus on paragraph breaks or specific markers within the text content. For example:
\n{2,} - This pattern splits the text content into chunks based on double newlines, which typically indicate paragraph breaks.
- For the child chunks, ensure that the regex captures the text content correctly. You might want to simplify the child chunk identifier to focus on paragraph breaks or specific markers within the text content. For example:
-
Preview and Adjust:
- Use Dify's "Preview Chunks" feature to test and adjust your regex patterns. This will help you verify if the chunks are being split as expected.
-
Text Preprocessing:
- Ensure that any preprocessing steps, such as removing unnecessary spaces or characters, are applied before chunking to maintain consistency.
By refining your regex patterns and using the preview feature, you should be able to achieve the desired structure for your Knowledge Base.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
I have tried creating the knowledge base based on your suggestions. However, the first title chunk includes the next title and URL as below, which prevents the expected structure from being achieved. Could you please provide additional solutions to resolve this issue?
Preview Result:
Chunk1 · 419 characters c-1 (Title1) URL1 Text1
c-2 (Title2) URL2 Text2-1
Chunk2 ·701 characters c-1 Text2-2 .....
......
To resolve the issue where the first title chunk includes the next title and URL, you can try adjusting the regex patterns to ensure that each title and URL is captured as a separate parent chunk. Here's a refined approach:
-
Parent Chunk Configuration:
- Modify the regex pattern to ensure it captures only the title and URL for each entry. You can use a pattern that stops at the next title marker:
^# (.+)\n(.+?)(?=\n#|\Z) - This pattern captures the title and URL, stopping at the next
#or the end of the file (\Z).
- Modify the regex pattern to ensure it captures only the title and URL for each entry. You can use a pattern that stops at the next title marker:
-
Child Chunk Configuration:
- For the child chunks, continue using a pattern that captures the text content based on paragraph breaks or specific markers. You can use:
\n{2,} - This pattern splits the text content into chunks based on double newlines.
- For the child chunks, continue using a pattern that captures the text content based on paragraph breaks or specific markers. You can use:
-
Testing and Adjustments:
- Use Dify's "Preview Chunks" feature to test these patterns and ensure they are capturing the desired structure. Adjust the patterns as needed based on the preview results.
By refining the parent chunk regex to stop at the next title marker, you should be able to achieve the expected structure for your Knowledge Base.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
@Dify team
Thank you for your updated advice. I tried your regex patterns. But failed. Please another concrete advice.
@dosu
Thank you for your updated advice. I tried your regex patterns. But failed to get identical result as before. Does this include bugs?
The issue you're experiencing might be related to how the regex patterns are applied or how the text is preprocessed before chunking. Here are a few additional steps you can take to troubleshoot and refine your approach:
-
Ensure Correct Preprocessing:
- Verify that any preprocessing steps, such as removing extra spaces or normalizing line endings, are consistent with your regex expectations. This can affect how patterns match the text.
-
Refine Parent Chunk Regex:
- If the previous regex didn't work as expected, try this alternative pattern for parent chunks:
^# (.+)\n(.*?)(?=\n#|\Z) - This pattern captures the title and URL, ensuring it stops at the next
#or the end of the file.
- If the previous regex didn't work as expected, try this alternative pattern for parent chunks:
-
Child Chunk Regex:
- For child chunks, ensure the pattern captures text content correctly. You might need to adjust based on the specific structure of your text:
(?<=\n\n)(.*?)(?=\n\n|$) - This pattern captures text between double newlines, which typically indicate paragraph breaks.
- For child chunks, ensure the pattern captures text content correctly. You might need to adjust based on the specific structure of your text:
-
Debugging with Smaller Samples:
- Test your regex patterns on smaller sections of your Markdown file to isolate issues. This can help identify where the pattern might be failing.
-
Check for Overlapping Patterns:
- Ensure that your regex patterns do not unintentionally overlap, which can cause unexpected results.
If these adjustments still don't resolve the issue, it might be worth checking if there are any known bugs related to regex processing in the version of Dify you are using. However, based on the information available, there are no documented bugs specifically related to regex-based chunk splitting in Dify [1][2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
@dosu
I tested to make a knowledge file using a quite a simple md file. But we couldn't separate individual chunk as each title.
To separate individual chunks by each title in a simple Markdown file, you can use the Markdown headers as delimiters. Here's a Python example using the markdown library and BeautifulSoup to parse and separate chunks by titles:
import markdown
from bs4 import BeautifulSoup
def separate_chunks_by_title(markdown_text):
# Convert Markdown to HTML
html = markdown.markdown(markdown_text)
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
chunks = []
current_chunk = []
for element in soup:
if element.name and element.name.startswith('h'):
# If we encounter a header, start a new chunk
if current_chunk:
chunks.append('\n'.join(current_chunk))
current_chunk = []
current_chunk.append(str(element))
# Add the last chunk if it exists
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
# Example usage
markdown_text = """
# Title 1
Content under title 1.
## Subtitle 1.1
Content under subtitle 1.1.
# Title 2
Content under title 2.
"""
chunks = separate_chunks_by_title(markdown_text)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk}\n")
This code will separate the Markdown content into chunks based on the headers, allowing you to manage each section individually. If you are using Dify, ensure that your chunking rules are correctly configured to match the structure of your Markdown file [1][2].
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other
@Dify team
I tested to make a knowledge file using a quite a simple md file. But we couldn't separate individual chunk as each title using the parent-child mode.
Parent Chunk Regex: ^# (.+)\n(.?)(?=\n#|\Z) Child Chunk Regex: (?<=\n\n)(.?)(?=\n\n|$)
FYI @DiFy is not our member or organization name :)
We will take a look at this issue later.
@crazywoola
You have helped me before, and I really appreciate it.
I apologize if I am not following the proper etiquette.
There’s no rush for a response—please take your time.
Currently, I am trying to work around the issue by manually chunking the data in Excel before importing it into Dify.
@crazywoola
Subject: Chunk Splitting Behavior Changed Unexpectedly
Hi,
I've noticed an unexpected change in the way chunks are being split when uploading the same file.
One week ago, the chunking behavior was inconsistent regarding #, sometimes reacting to it and sometimes not. However, now it still behaves inconsistently with #, but in a different manner.
Previously, I was able to separate parent chunks using ======, but this no longer works as expected.
I checked the release notes but couldn't find any relevant updates. If there have been any changes affecting chunking behavior, could you provide any insights?
Thanks!
@numeyume I faced the same problem and found a workaround. By inserting custom delimiter markers (===) at the intended split locations and saving as a .txt file, I managed to resolve the issue. Using regular expressions does not seem to be working as expected.
+1
I encountered the same problem and it seemed that regex in dify might not work as expected.
😅
I converted the markdown file to txt and the regular takes effect as expected, but this is not a standard solution
I converted the markdown file to txt and the regular takes effect as expected, but this is not a standard solution
It seems that the md file has a special splitter