dify icon indicating copy to clipboard operation
dify copied to clipboard

Issue with chunk splitting using parent-child mode in a Markdown file

Open numeyume opened this issue 10 months ago • 19 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

0.15.2

Cloud or Self Hosted

Cloud

Steps to reproduce

I am currently attempting to create a Knowledge Base using the parent-child mode with the following Markdown file.

The file contains 700 entries, each separated by a # and composed of a title, a URL, and corresponding text content. I configured the chunk splitting as follows, using these regular expressions:

Parent chunk identifier: (^# .(?:\n(?!# ).)) Child chunk identifier: (^## .+)|(\n{2,})|(^\s[-*+]\s)|(^|.+|)

The expected result was a Knowledge Base where the parent chunk consists of the "title and URL" and the child chunk consists of the "text content." However, the actual result included multiple URLs within a single chunk, which did not meet the expected structure.

I also tested simpler regular expressions, but the result remained the same. Below is a simplified example of the Markdown file (structure only):


Title1

URL1
TEXT1

Title2

URL2
TEXT2

...

Title700

URL700
TEXT700

I would like to request guidance on how to create a Knowledge Base using this Markdown file, where the parent chunk consists of the "title and URL" and the child chunk consists of the "text content." This can be achieved using the parent-child mode or any other feasible method.

✔️ Expected Behavior

Markdown file must be created where the parent chunk consists of the "title and URL" and the child chunk consists of the "text content."

❌ Actual Behavior

Markdown file was created as the actual result included multiple URLs within a single chunk.

numeyume avatar Feb 01 '25 06:02 numeyume

I found a similar unsolved discussion that might be relevant to your issue: HelpDeveloping a Business Chatbot with Dify and Challenges in Optimizing Knowledge Base Chunking [1].

To address your issue with chunk splitting in parent-child mode, you can try the following approach:

  1. Parent Chunk Configuration:

    • Use a regex pattern that accurately captures the title and URL as a single parent chunk. Your current regex (^# .*(?:\n(?!# ).*)*) might be too broad. Consider simplifying it to capture only the title and URL, such as:
      ^# (.+)\n(.+)
      
    • This pattern assumes each entry starts with a # followed by the title and URL on separate lines.
  2. Child Chunk Configuration:

    • For the child chunks, ensure that the regex captures the text content correctly. You might want to simplify the child chunk identifier to focus on paragraph breaks or specific markers within the text content. For example:
      \n{2,}
      
    • This pattern splits the text content into chunks based on double newlines, which typically indicate paragraph breaks.
  3. Preview and Adjust:

    • Use Dify's "Preview Chunks" feature to test and adjust your regex patterns. This will help you verify if the chunks are being split as expected.
  4. Text Preprocessing:

    • Ensure that any preprocessing steps, such as removing unnecessary spaces or characters, are applied before chunking to maintain consistency.

By refining your regex patterns and using the preview feature, you should be able to achieve the desired structure for your Knowledge Base.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 01 '25 06:02 dosubot[bot]

I have tried creating the knowledge base based on your suggestions. However, the first title chunk includes the next title and URL as below, which prevents the expected structure from being achieved. Could you please provide additional solutions to resolve this issue?

Preview Result:

Chunk1 · 419 characters c-1 (Title1) URL1 Text1

c-2 (Title2) URL2 Text2-1

Chunk2 ·701 characters c-1 Text2-2 .....

......

numeyume avatar Feb 01 '25 07:02 numeyume

To resolve the issue where the first title chunk includes the next title and URL, you can try adjusting the regex patterns to ensure that each title and URL is captured as a separate parent chunk. Here's a refined approach:

  1. Parent Chunk Configuration:

    • Modify the regex pattern to ensure it captures only the title and URL for each entry. You can use a pattern that stops at the next title marker:
      ^# (.+)\n(.+?)(?=\n#|\Z)
      
    • This pattern captures the title and URL, stopping at the next # or the end of the file (\Z).
  2. Child Chunk Configuration:

    • For the child chunks, continue using a pattern that captures the text content based on paragraph breaks or specific markers. You can use:
      \n{2,}
      
    • This pattern splits the text content into chunks based on double newlines.
  3. Testing and Adjustments:

    • Use Dify's "Preview Chunks" feature to test these patterns and ensure they are capturing the desired structure. Adjust the patterns as needed based on the preview results.

By refining the parent chunk regex to stop at the next title marker, you should be able to achieve the expected structure for your Knowledge Base.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 01 '25 07:02 dosubot[bot]

@Dify team

Thank you for your updated advice. I tried your regex patterns. But failed. Please another concrete advice.

numeyume avatar Feb 01 '25 07:02 numeyume

@dosu

Thank you for your updated advice. I tried your regex patterns. But failed to get identical result as before. Does this include bugs?

numeyume avatar Feb 01 '25 07:02 numeyume

The issue you're experiencing might be related to how the regex patterns are applied or how the text is preprocessed before chunking. Here are a few additional steps you can take to troubleshoot and refine your approach:

  1. Ensure Correct Preprocessing:

    • Verify that any preprocessing steps, such as removing extra spaces or normalizing line endings, are consistent with your regex expectations. This can affect how patterns match the text.
  2. Refine Parent Chunk Regex:

    • If the previous regex didn't work as expected, try this alternative pattern for parent chunks:
      ^# (.+)\n(.*?)(?=\n#|\Z)
      
    • This pattern captures the title and URL, ensuring it stops at the next # or the end of the file.
  3. Child Chunk Regex:

    • For child chunks, ensure the pattern captures text content correctly. You might need to adjust based on the specific structure of your text:
      (?<=\n\n)(.*?)(?=\n\n|$)
      
    • This pattern captures text between double newlines, which typically indicate paragraph breaks.
  4. Debugging with Smaller Samples:

    • Test your regex patterns on smaller sections of your Markdown file to isolate issues. This can help identify where the pattern might be failing.
  5. Check for Overlapping Patterns:

    • Ensure that your regex patterns do not unintentionally overlap, which can cause unexpected results.

If these adjustments still don't resolve the issue, it might be worth checking if there are any known bugs related to regex processing in the version of Dify you are using. However, based on the information available, there are no documented bugs specifically related to regex-based chunk splitting in Dify [1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 01 '25 07:02 dosubot[bot]

@dosu

I tested to make a knowledge file using a quite a simple md file. But we couldn't separate individual chunk as each title.

test2.md

numeyume avatar Feb 01 '25 08:02 numeyume

Image

numeyume avatar Feb 01 '25 08:02 numeyume

To separate individual chunks by each title in a simple Markdown file, you can use the Markdown headers as delimiters. Here's a Python example using the markdown library and BeautifulSoup to parse and separate chunks by titles:

import markdown
from bs4 import BeautifulSoup

def separate_chunks_by_title(markdown_text):
    # Convert Markdown to HTML
    html = markdown.markdown(markdown_text)
    # Parse HTML with BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    
    chunks = []
    current_chunk = []
    
    for element in soup:
        if element.name and element.name.startswith('h'):
            # If we encounter a header, start a new chunk
            if current_chunk:
                chunks.append('\n'.join(current_chunk))
                current_chunk = []
        current_chunk.append(str(element))
    
    # Add the last chunk if it exists
    if current_chunk:
        chunks.append('\n'.join(current_chunk))
    
    return chunks

# Example usage
markdown_text = """
# Title 1
Content under title 1.

## Subtitle 1.1
Content under subtitle 1.1.

# Title 2
Content under title 2.
"""

chunks = separate_chunks_by_title(markdown_text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")

This code will separate the Markdown content into chunks based on the headers, allowing you to manage each section individually. If you are using Dify, ensure that your chunking rules are correctly configured to match the structure of your Markdown file [1][2].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

dosubot[bot] avatar Feb 01 '25 08:02 dosubot[bot]

@Dify team

I tested to make a knowledge file using a quite a simple md file. But we couldn't separate individual chunk as each title using the parent-child mode.

Parent Chunk Regex: ^# (.+)\n(.?)(?=\n#|\Z) Child Chunk Regex: (?<=\n\n)(.?)(?=\n\n|$)

test2.md

numeyume avatar Feb 01 '25 08:02 numeyume

FYI @DiFy is not our member or organization name :)

We will take a look at this issue later.

crazywoola avatar Feb 02 '25 07:02 crazywoola

@crazywoola

You have helped me before, and I really appreciate it.
I apologize if I am not following the proper etiquette.
There’s no rush for a response—please take your time.

Currently, I am trying to work around the issue by manually chunking the data in Excel before importing it into Dify.

numeyume avatar Feb 02 '25 09:02 numeyume

@crazywoola

Subject: Chunk Splitting Behavior Changed Unexpectedly

Hi,

I've noticed an unexpected change in the way chunks are being split when uploading the same file.

One week ago, the chunking behavior was inconsistent regarding #, sometimes reacting to it and sometimes not. However, now it still behaves inconsistently with #, but in a different manner.

Previously, I was able to separate parent chunks using ======, but this no longer works as expected.

I checked the release notes but couldn't find any relevant updates. If there have been any changes affecting chunking behavior, could you provide any insights?

Thanks!

numeyume avatar Feb 15 '25 10:02 numeyume

@numeyume I faced the same problem and found a workaround. By inserting custom delimiter markers (===) at the intended split locations and saving as a .txt file, I managed to resolve the issue. Using regular expressions does not seem to be working as expected.

xieshaocong-ethan avatar Feb 15 '25 17:02 xieshaocong-ethan

+1

rowstop avatar Feb 28 '25 07:02 rowstop

I encountered the same problem and it seemed that regex in dify might not work as expected.

Scofield-Gu avatar Mar 02 '25 02:03 Scofield-Gu

😅

rowstop avatar Mar 04 '25 06:03 rowstop

I converted the markdown file to txt and the regular takes effect as expected, but this is not a standard solution

hoojunev avatar Mar 13 '25 01:03 hoojunev

I converted the markdown file to txt and the regular takes effect as expected, but this is not a standard solution

It seems that the md file has a special splitter

rowstop avatar Mar 17 '25 01:03 rowstop