dify icon indicating copy to clipboard operation
dify copied to clipboard

Markdown Heading Merging Fails

Open jjk334 opened this issue 9 months ago • 2 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

1.0.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

I referenced this issue (https://github.com/langgenius/dify/issues/15043#issue-2899027141) and modified the code, restarted the corresponding container service, but using either "#" or "##" to split still fails to merge the following markdown content into a single block

1.123123

1.1 123123

98403282e6c7 langgenius/dify-api:1.0.0 "/bin/bash /entrypoi…" 4 days ago Up 18 hours 5001/tcp docker-worker-1 198bf9ec31c1 langgenius/dify-api:1.0.0 "/bin/bash /entrypoi…" 4 days ago Up 18 hours 5001/tcp

Image Image

✔️ Expected Behavior

No response

❌ Actual Behavior

No response

jjk334 avatar Mar 11 '25 01:03 jjk334

Hey @jjk334! 👋 I'm here to help you with bug fixes, questions, and contributions. Let's tackle this issue together!

The issue you're experiencing with markdown heading merging in Dify version 1.0.0 seems to be related to how the text_splitter.py file processes markdown headers. The logic for merging headers is based on the header_stack and current_header_level variables, which manage the nested structure of headers. When a new header is encountered, headers of the same or higher level are removed from the stack, which should allow for merging [1].

If you've already modified the code and restarted the Docker container without success, consider checking the following:

  1. Ensure Correct Header Levels: Verify that the headers_to_split_on configuration includes both "#" and "##" and that they are correctly prioritized in the code. The headers should be sorted by length in descending order to ensure proper processing.

  2. Review Code Changes: Double-check any modifications made to the text_splitter.py file to ensure they align with the intended logic for header merging. Ensure that the aggregate_lines_to_chunks method is correctly aggregating lines with common metadata.

  3. Check for Overlapping Configurations: Ensure that the text segmentation settings, such as maximum segment length and overlap, are not interfering with the merging process.

If these steps do not resolve the issue, you might need to further investigate the specific logic in the text_splitter.py file or consider additional debugging to identify any overlooked factors affecting the merging behavior.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

dosubot[bot] avatar Mar 11 '25 01:03 dosubot[bot]

Can you help me find a solution? I'm not very familiar with coding.

jjk334 avatar Mar 11 '25 09:03 jjk334

Bro, did you fixed the source code and recompile it? 兄弟你是自己改了源码修复了吗?

hoojunev avatar Mar 13 '25 09:03 hoojunev