dify icon indicating copy to clipboard operation
dify copied to clipboard

Chunking ignored all texts before hash sign in markdown documents

Open tjroamer opened this issue 8 months ago • 1 comments

Self Checks

  • [x] This is only for bug report, if you would like to ask a question, please head to Discussions.
  • [x] I have searched for existing issues search for existing issues, including closed ones.
  • [x] I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [x] [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • [x] Please do not modify this template :) and fill in all the required fields.

Dify version

1.4.1

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

  1. Create Knowledge
  2. Upload a markdown file as attached, click Next
  3. Leave all settings as default, click the button Preview Chunk

buggy_md_doc.md

✔️ Expected Behavior

The texts before hash sign are considered as a chunk.

❌ Actual Behavior

All texts before the hash sign are ignored, and no chunk was created.

tjroamer avatar Jun 08 '25 11:06 tjroamer

The markdown chunking logic in Dify is designed to preserve text before the first header (hash sign) as a chunk with empty metadata. If this pre-header text is missing in your chunks, it's likely being filtered out later in the pipeline—often due to metadata-based filtering that excludes chunks with empty metadata fields. This behavior is intentional and configurable: if your dataset or retrieval settings require a metadata field (like "Section") to be "not empty," any chunk without that metadata (including pre-header content) will be ignored in results. To include the introductory text, review your metadata filtering settings and ensure they don't exclude chunks with empty metadata.
You can see the relevant chunking and filtering logic in the code here:
MarkdownHeaderTextSplitter implementation
Metadata filtering in retrieval

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Join Discord Share on X

dosubot[bot] avatar Jun 08 '25 12:06 dosubot[bot]

same issue.

suncle1993 avatar Jun 30 '25 09:06 suncle1993

Hi, @tjroamer. I'm Dosu, and I'm helping the Dify team manage their backlog and am marking this issue as stale.

Issue Summary:

  • You reported that in Dify v1.4.1 self-hosted, text before the first markdown header (#) was ignored during chunking.
  • The chunking logic does preserve this pre-header text as chunks with empty metadata.
  • These chunks may be filtered out later due to metadata-based filtering settings.
  • Adjusting the metadata filtering to include chunks with empty metadata resolves the issue.
  • Another user, suncle1993, confirmed experiencing the same behavior.

Next Steps:

  • Please confirm if this issue is still relevant with the latest version of Dify.
  • If it is, feel free to keep the discussion open by commenting; otherwise, I will automatically close this issue in 15 days.

Thanks for your understanding and contribution!

dosubot[bot] avatar Aug 28 '25 16:08 dosubot[bot]