Markdown Heading Merging Fails
Self Checks
- [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
- [x] I have searched for existing issues search for existing issues, including closed ones.
- [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
- [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
- [x] Please do not modify this template :) and fill in all the required fields.
Dify version
1.0.0
Cloud or Self Hosted
Self Hosted (Docker)
Steps to reproduce
I referenced this issue (https://github.com/langgenius/dify/issues/15043#issue-2899027141) and modified the code, restarted the corresponding container service, but using either "#" or "##" to split still fails to merge the following markdown content into a single block
1.123123
1.1 123123
98403282e6c7 langgenius/dify-api:1.0.0 "/bin/bash /entrypoi…" 4 days ago Up 18 hours 5001/tcp docker-worker-1 198bf9ec31c1 langgenius/dify-api:1.0.0 "/bin/bash /entrypoi…" 4 days ago Up 18 hours 5001/tcp
✔️ Expected Behavior
No response
❌ Actual Behavior
No response
Hey @jjk334! 👋 I'm here to help you with bug fixes, questions, and contributions. Let's tackle this issue together!
The issue you're experiencing with markdown heading merging in Dify version 1.0.0 seems to be related to how the text_splitter.py file processes markdown headers. The logic for merging headers is based on the header_stack and current_header_level variables, which manage the nested structure of headers. When a new header is encountered, headers of the same or higher level are removed from the stack, which should allow for merging [1].
If you've already modified the code and restarted the Docker container without success, consider checking the following:
-
Ensure Correct Header Levels: Verify that the
headers_to_split_onconfiguration includes both "#" and "##" and that they are correctly prioritized in the code. The headers should be sorted by length in descending order to ensure proper processing. -
Review Code Changes: Double-check any modifications made to the
text_splitter.pyfile to ensure they align with the intended logic for header merging. Ensure that theaggregate_lines_to_chunksmethod is correctly aggregating lines with common metadata. -
Check for Overlapping Configurations: Ensure that the text segmentation settings, such as maximum segment length and overlap, are not interfering with the merging process.
If these steps do not resolve the issue, you might need to further investigate the specific logic in the text_splitter.py file or consider additional debugging to identify any overlooked factors affecting the merging behavior.
To continue talking to Dosu, mention @dosu.
Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other
Can you help me find a solution? I'm not very familiar with coding.
Bro, did you fixed the source code and recompile it? 兄弟你是自己改了源码修复了吗?