haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: MarkdownHeaderSplitter

Open OGuggenbuehl opened this issue 5 months ago • 11 comments

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

OGuggenbuehl avatar Jul 29 '25 13:07 OGuggenbuehl

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jul 29 '25 13:07 CLAassistant

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

sjrl avatar Aug 19 '25 11:08 sjrl

Thanks for your continued work on this @OGuggenbuehl!

Some general comments. Could you:

  • Add a release note for this PR following the instructions here
  • Could you make sure to include our license header to the beginning of each file you've added. You can find an example of the license header here
  • Please make sure to sign the CLA agreement (docs about it here) from this comment
  • If you haven't already please also set up pre-commit hooks using pre-commit install. You can find more info about that in this section of our contribution guidelines.
  • Also in the future feel free to open branches directly in Haystack instead of using a fork. This makes it slightly easier to pull down your code to review locally.

sjrl avatar Sep 18 '25 06:09 sjrl

Pull Request Test Coverage Report for Build 19816136481

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 92.189%

Files with Coverage Reduction New Missed Lines %
core/pipeline/async_pipeline.py 3 65.88%
<!-- Total: 3
Totals Coverage Status
Change from base Build 19775495543: 0.0%
Covered Lines: 14174
Relevant Lines: 15375

💛 - Coveralls

coveralls avatar Sep 19 '25 15:09 coveralls

Thank you for your review, @sjrl ! I resolved the remaining comments and turned this from draft to proper PR. note: pre-commit added some fixes to pre-existing files I did not touch otherwise - not sure if this is wanted

OGuggenbuehl avatar Sep 19 '25 16:09 OGuggenbuehl

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

OGuggenbuehl avatar Sep 27 '25 10:09 OGuggenbuehl

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

Good question! It does sound different from what our other splitters do and almost fits better into the DocumentCleaner abstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?

sjrl avatar Sep 29 '25 06:09 sjrl

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

Good question! It does sound different from what our other splitters do and almost fits better into the DocumentCleaner abstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?

I feel that this would be a more appropriate approach - also because it would improve the separation of concerns by component. I suggest I:

  • remove the header-level inference functionality from this component
  • re-implement it as its own component and put up a PR in haystack-experimental

OGuggenbuehl avatar Sep 29 '25 08:09 OGuggenbuehl

@OGuggenbuehl that sounds good!

sjrl avatar Sep 29 '25 08:09 sjrl

@sjrl done! MarkdownHeaderLevelsInferrer now lives here and is entirely removed from this PR

OGuggenbuehl avatar Sep 29 '25 14:09 OGuggenbuehl

@OGuggenbuehl is attempting to deploy a commit to the deepset Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Oct 21 '25 08:10 vercel[bot]